BlackWaspTM
SQL Server
SQL 2005+

Removing Duplicates From a SQL Server Table (2)

It is possible to have duplicated data in a SQL Server database table. This may happen when a user enters the same information twice, or when importing data with constraints disabled. In either case, you may wish to delete the duplicated rows.

Removing Complete Duplicates

The previous method for removing duplicates works because the primary key values are unique. If non-unique primary key values existed, or if no primary key constraint exists, the process becomes more complicated. To demonstrate, we need a new table that is populated with data including complete duplicates. Execute the following to create such a table:

CREATE TABLE CustomersNoKey
(
    FirstName VARCHAR(20),
    LastName VARCHAR(20)
)

INSERT INTO CustomersNoKey VALUES ('Bob', 'Brown')
INSERT INTO CustomersNoKey VALUES ('Jim', 'Jones')
INSERT INTO CustomersNoKey VALUES ('Jim', 'Jones')
INSERT INTO CustomersNoKey VALUES ('Jim', 'Jones')
INSERT INTO CustomersNoKey VALUES ('Bob', 'Brown')
INSERT INTO CustomersNoKey VALUES ('Pam', 'Penny')

Identifying the duplicate data is still achieved using a query with grouping. Executing the following gives a distinct list of names and the number of times that they appear in the table.

SELECT FirstName, LastName, Count(*) FROM CustomersNoKey
GROUP BY FirstName, LastName

Deleting the duplicate rows is more difficult, as there are no IDs that we can use to identify the rows to keep. Instead we will follow a three-stage process. The steps are:

  • Retrieve the grouped results, including the count of matching rows, and store them in a temporary table.
  • Use a cursor to loop through every row in the temporary table.
  • For each temporary table row, delete the appropriate number of rows from the original table. The number of rows to remove will be one less than the aggregated count value.

To create the temporary table, add the following code. Don't execute this until all steps are in place.

CREATE TABLE #TempCust
(
    FirstName VARCHAR(20),
    LastName VARCHAR(20),
    Count INT
)

The temporary table is populated with an INSERT statement and the results of the previous query. To improve efficiency, you could add a HAVING clause to exclude rows with a Count of one, as these do not represent duplicated data. For the purposes of this article the additional clause is unnecessary.

INSERT INTO #TempCust (FirstName, LastName, Count)
SELECT FirstName, LastName, Count(*) FROM CustomersNoKey
GROUP BY FirstName, LastName

Next we need to declare the cursor. This will operate upon every row from the temporary table.

DECLARE @FirstName VARCHAR(20), @LastName VARCHAR(20), @Count INT

DECLARE DuplicateCursor CURSOR
FOR SELECT FirstName, LastName, Count FROM #TempCust

The cursor can now be opened and the first row retrieved.

OPEN DuplicateCursor
FETCH NEXT FROM DuplicateCursor INTO @FirstName, @LastName, @Count

The cursor will process every row from the temporary table within a WHILE loop. During each iteration, rows will be deleted from the original table. The DELETE statement's TOP clause is used to delete one less row that the count value from the temporary table, leaving one unique row in each case.

WHILE @@FETCH_STATUS = 0
BEGIN
    DELETE TOP (@Count-1) FROM CustomersNoKey
    WHERE FirstName = @FirstName AND LastName = @LastName

    FETCH NEXT FROM DuplicateCursor INTO @FirstName, @LastName, @Count
END

To complete the Transact-SQL batch, the cursor should be closed and deallocated and the temporary table deleted.

CLOSE DuplicateCursor
DEALLOCATE DuplicateCursor
DROP TABLE #TempCust

You can now execute the entire script to delete the duplicates. On completion, run the following query to confirm that only unique rows remain.

SELECT * FROM CustomersNoKey
27 June 2010