PostgreSQL - Deleting Duplicate Rows using Subquery

Last Updated : 26 Aug, 2024

In PostgreSQL, handling duplicate rows is a common task, especially when working with large datasets. Fortunately, PostgreSQL provides several techniques to efficiently delete duplicate rows, and one of the most effective approaches is using subqueries.

In this article, we will demonstrate how to identify and remove duplicate rows while keeping the row with either the lowest or highest ID, depending on your requirements.

Setting Up a Sample Table

For the purpose of demonstration let's set up a sample table(say, 'basket') that stores 'fruits' as follows:

PostgreSQL

CREATE TABLE basket(
    id SERIAL PRIMARY KEY,
    fruit VARCHAR(50) NOT NULL
);
INSERT INTO basket(fruit) values('apple');
INSERT INTO basket(fruit) values('apple');

INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('orange');

INSERT INTO basket(fruit) values('banana');
SELECT * FROM basket;

This should result into below:

Now that we have set up the sample table, we will query for the duplicates using the following.

Query:

SELECT
    fruit,
    COUNT( fruit )
FROM
    basket
GROUP BY
    fruit
HAVING
    COUNT( fruit )> 1
ORDER BY
    fruit;

This should lead to the following results:

Deleting Duplicate Rows with a Subquery

To delete the duplicate rows while keeping the row with the lowest ID, you can use a subquery with the 'ROW_NUMBER()' window function. This method ensures that only one row per fruit is retained, and all other duplicates are removed.

Query:

DELETE FROM basket
WHERE id IN
    (SELECT id
    FROM 
        (SELECT id,
         ROW_NUMBER() OVER( PARTITION BY fruit ORDER BY  id ) AS row_num
        FROM basket ) t
        WHERE t.row_num > 1 );

Explanation:

The inner subquery assigns a row number to each row within each partition (grouped by 'fruit'), ordered by 'id'.
The ROW_NUMBER() function starts counting from 1 for each group, so the first row in each group is retained, and the rest are marked for deletion.
The outer DELETE statement removes the rows identified by the subquery.

Keeping the Row with the Highest ID

If you want to keep the duplicate row with highest id, just change the order in the subquery:

DELETE FROM basket
WHERE id IN
    (SELECT id
    FROM 
        (SELECT id,
         ROW_NUMBER() OVER( PARTITION BY fruit ORDER BY  id ) AS row_num
        FROM basket ) t
        WHERE t.row_num > 1 );

This query will retain the row with the highest ID for each duplicate group and delete all other duplicates.

Deleting Duplicates Based on Multiple Columns

In case you want to delete duplicate based on values of multiple columns, here is the query template.

Query:

DELETE FROM table_name
WHERE id IN
    (SELECT id
    FROM 
        (SELECT id,
         ROW_NUMBER() OVER( PARTITION BY column_1, column_2 ORDER BY  id ) AS row_num
        FROM table_name ) t
        WHERE t.row_num > 1 );

Explanation:

The PARTITION BY clause includes multiple columns ('column_1', 'column_2'), ensuring duplicates are identified based on the combination of those columns.
The rest of the logic remains the same.

Verifying the Result

In this case, the statement will delete all rows with duplicate values in the 'column_1' and 'column_2' columns. To verify the above use the below query.

Query:

SELECT
    fruit,
    COUNT( fruit )
FROM
    basket
GROUP BY
    fruit
HAVING
    COUNT( fruit )> 1
ORDER BY
    fruit;

Output:

If the deletion was successful, this query should return an empty result set, indicating no duplicates remain.

PostgreSQL - Deleting Duplicate Rows using Subquery

RajuKumar19

Improve

Article Tags :

PostgreSQL - Deleting Duplicate Rows using Subquery

Setting Up a Sample Table

Deleting Duplicate Rows with a Subquery

Keeping the Row with the Highest ID

Deleting Duplicates Based on Multiple Columns

Verifying the Result

Similar Reads

Thank You!

What kind of Experience do you want to share?