BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

SQL Server
SQL 2005+

Removing Duplicates From a SQL Server Table

It is possible to have duplicated data in a SQL Server database table. This may happen when a user enters the same information twice, or when importing data with constraints disabled. In either case, you may wish to delete the duplicated rows.

Removing Duplicates

Duplicated data in a SQL Server table can cause various problems depending upon the applications that use the information. Duplicate information is often caused by user error, should a user enter the same information twice. In such cases, it is likely that a primary key exists and that the duplicates exist with differing keys. It is also possible for a table to exist with complete duplicates, where every column contains the same information. This can happen where no primary key or unique constraints exist, or where these constraints have been disabled during a bulk insert or data import. In either of these situations, it is common to wish to remove the duplicates and leave the table with only unique rows.

In this article we will examine processes for removing duplicates in both of the above scenarios. The sample Transact-SQL code demonstrates de-duplication processes using two example tables. In real-world situations you should ensure that your data is recoverable before attempting these types of operations.

Removing Duplicates from a Table with a Primary Key

For the first example, we will remove duplicate data from a table with a primary key. In this case, the primary key values will be unique but the other data in the table will contain duplicates. To create the sample table and populate it with data, execute the following script:

CREATE TABLE Customers
(
    ID INT PRIMARY KEY CLUSTERED,
    FirstName VARCHAR(20),
    LastName VARCHAR(20)
)

INSERT INTO Customers VALUES (1, 'Bob', 'Brown')
INSERT INTO Customers VALUES (2, 'Jim', 'Jones')
INSERT INTO Customers VALUES (3, 'Jim', 'Jones')
INSERT INTO Customers VALUES (4, 'Jim', 'Jones')
INSERT INTO Customers VALUES (5, 'Bob', 'Brown')
INSERT INTO Customers VALUES (6, 'Pam', 'Penny')

Finding the duplicates is relatively simple. You can use a standard SELECT statement to retrieve all rows from the table, grouped by the columns that should be unique. The following query retrieves the distinct set of first and last names. Two aggregate columns are also included. The first of these returns the lowest ID value for the name. The second returns the number of rows that contain the first and last name.

SELECT min(ID), FirstName, LastName, Count(*) FROM Customers
GROUP BY FirstName, LastName

The results of the query are similar to the data that we wish the table to contain after de-duplication. We can therefore simply delete all rows from the table where the ID of the row is not within this list. To do so, execute the following DELETE statement:

DELETE FROM Customers WHERE ID NOT IN
(SELECT min(ID) FROM Customers GROUP BY FirstName, LastName)

To verify that the duplicate rows have been successfully removed, select all data from the table:

SELECT * FROM Customers

Removing Complete Duplicates

The previous method for removing duplicates works because the primary key values are unique. If non-unique primary key values existed, or if no primary key constraint exists, the process becomes more complicated. To demonstrate, we need a new table that is populated with data including complete duplicates. Execute the following to create such a table:

CREATE TABLE CustomersNoKey
(
    FirstName VARCHAR(20),
    LastName VARCHAR(20)
)

INSERT INTO CustomersNoKey VALUES ('Bob', 'Brown')
INSERT INTO CustomersNoKey VALUES ('Jim', 'Jones')
INSERT INTO CustomersNoKey VALUES ('Jim', 'Jones')
INSERT INTO CustomersNoKey VALUES ('Jim', 'Jones')
INSERT INTO CustomersNoKey VALUES ('Bob', 'Brown')
INSERT INTO CustomersNoKey VALUES ('Pam', 'Penny')

Identifying the duplicate data is still achieved using a query with grouping. Executing the following gives a distinct list of names and the number of times that they appear in the table.

SELECT FirstName, LastName, Count(*) FROM CustomersNoKey
GROUP BY FirstName, LastName

Deleting the duplicate rows is more difficult, as there are no IDs that we can use to identify the rows to keep. Instead we will follow a three-stage process. The steps are:

  • Retrieve the grouped results, including the count of matching rows, and store them in a temporary table.
  • Use a cursor to loop through every row in the temporary table.
  • For each temporary table row, delete the appropriate number of rows from the original table. The number of rows to remove will be one less than the aggregated count value.

To create the temporary table, add the following code. Don't execute this until all steps are in place.

CREATE TABLE #TempCust
(
    FirstName VARCHAR(20),
    LastName VARCHAR(20),
    Count INT
)

The temporary table is populated with an INSERT statement and the results of the previous query. To improve efficiency, you could add a HAVING clause to exclude rows with a Count of one, as these do not represent duplicated data. For the purposes of this article the additional clause is unnecessary.

INSERT INTO #TempCust (FirstName, LastName, Count)
SELECT FirstName, LastName, Count(*) FROM CustomersNoKey
GROUP BY FirstName, LastName

Next we need to declare the cursor. This will operate upon every row from the temporary table.

DECLARE @FirstName VARCHAR(20), @LastName VARCHAR(20), @Count INT

DECLARE DuplicateCursor CURSOR
FOR SELECT FirstName, LastName, Count FROM #TempCust

The cursor can now be opened and the first row retrieved.

OPEN DuplicateCursor
FETCH NEXT FROM DuplicateCursor INTO @FirstName, @LastName, @Count

The cursor will process every row from the temporary table within a WHILE loop. During each iteration, rows will be deleted from the original table. The DELETE statement's TOP clause is used to delete one less row that the count value from the temporary table, leaving one unique row in each case.

WHILE @@FETCH_STATUS = 0
BEGIN
    DELETE TOP (@Count-1) FROM CustomersNoKey
    WHERE FirstName = @FirstName AND LastName = @LastName

    FETCH NEXT FROM DuplicateCursor INTO @FirstName, @LastName, @Count
END

To complete the Transact-SQL batch, the cursor should be closed and deallocated and the temporary table deleted.

CLOSE DuplicateCursor
DEALLOCATE DuplicateCursor
DROP TABLE #TempCust

You can now execute the entire script to delete the duplicates. On completion, run the following query to confirm that only unique rows remain.

SELECT * FROM CustomersNoKey
27 June 2010