PostgreSQL - Deleting Duplicate Rows using Subquery
Last Updated :
26 Aug, 2024
In PostgreSQL, handling duplicate rows is a common task, especially when working with large datasets. Fortunately, PostgreSQL provides several techniques to efficiently delete duplicate rows, and one of the most effective approaches is using subqueries.
In this article, we will demonstrate how to identify and remove duplicate rows while keeping the row with either the lowest or highest ID, depending on your requirements.
Setting Up a Sample Table
For the purpose of demonstration let's set up a sample table(say, 'basket') that stores 'fruits' as follows:
PostgreSQL
CREATE TABLE basket(
id SERIAL PRIMARY KEY,
fruit VARCHAR(50) NOT NULL
);
INSERT INTO basket(fruit) values('apple');
INSERT INTO basket(fruit) values('apple');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('banana');
SELECT * FROM basket;
This should result into below:

Now that we have set up the sample table, we will query for the duplicates using the following.
Query:
SELECT
fruit,
COUNT( fruit )
FROM
basket
GROUP BY
fruit
HAVING
COUNT( fruit )> 1
ORDER BY
fruit;
This should lead to the following results:

Deleting Duplicate Rows with a Subquery
To delete the duplicate rows while keeping the row with the lowest ID, you can use a subquery with the 'ROW_NUMBER()'
window function. This method ensures that only one row per fruit is retained, and all other duplicates are removed.
Query:
DELETE FROM basket
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY fruit ORDER BY id ) AS row_num
FROM basket ) t
WHERE t.row_num > 1 );
Explanation:
- The inner subquery assigns a row number to each row within each partition (grouped by 'fruit'), ordered by 'id'.
- The ROW_NUMBER() function starts counting from 1 for each group, so the first row in each group is retained, and the rest are marked for deletion.
- The outer DELETE statement removes the rows identified by the subquery.
Keeping the Row with the Highest ID
If you want to keep the duplicate row with highest id, just change the order in the subquery:
DELETE FROM basket
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY fruit ORDER BY id ) AS row_num
FROM basket ) t
WHERE t.row_num > 1 );
This query will retain the row with the highest ID for each duplicate group and delete all other duplicates.
Deleting Duplicates Based on Multiple Columns
In case you want to delete duplicate based on values of multiple columns, here is the query template.
Query:
DELETE FROM table_name
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY column_1, column_2 ORDER BY id ) AS row_num
FROM table_name ) t
WHERE t.row_num > 1 );
Explanation:
- The
PARTITION BY
clause includes multiple columns ('column_1', 'column_2'
), ensuring duplicates are identified based on the combination of those columns. - The rest of the logic remains the same.
Verifying the Result
In this case, the statement will delete all rows with duplicate values in the 'column_1' and 'column_2' columns. To verify the above use the below query.
Query:
SELECT
fruit,
COUNT( fruit )
FROM
basket
GROUP BY
fruit
HAVING
COUNT( fruit )> 1
ORDER BY
fruit;
Output:

If the deletion was successful, this query should return an empty result set, indicating no duplicates remain.