
SQL 2005+Removing Duplicates From a SQL Server Table
It is possible to have duplicated data in a SQL Server database table. This may happen when a user enters the same information twice, or when importing data with constraints disabled. In either case, you may wish to delete the duplicated rows.
Removing Duplicates
Duplicated data in a SQL Server table can cause various problems depending upon the applications that use the information. Duplicate information is often caused by user error, should a user enter the same information twice. In such cases, it is likely that a primary key exists and that the duplicates exist with differing keys. It is also possible for a table to exist with complete duplicates, where every column contains the same information. This can happen where no primary key or unique constraints exist, or where these constraints have been disabled during a bulk insert or data import. In either of these situations, it is common to wish to remove the duplicates and leave the table with only unique rows.
In this article we will examine processes for removing duplicates in both of the above scenarios. The sample Transact-SQL code demonstrates de-duplication processes using two example tables. In real-world situations you should ensure that your data is recoverable before attempting these types of operations.
Removing Duplicates from a Table with a Primary Key
For the first example, we will remove duplicate data from a table with a primary key. In this case, the primary key values will be unique but the other data in the table will contain duplicates. To create the sample table and populate it with data, execute the following script:
CREATE TABLE Customers
(
ID INT PRIMARY KEY CLUSTERED,
FirstName VARCHAR(20),
LastName VARCHAR(20)
)
INSERT INTO Customers VALUES (1, 'Bob', 'Brown')
INSERT INTO Customers VALUES (2, 'Jim', 'Jones')
INSERT INTO Customers VALUES (3, 'Jim', 'Jones')
INSERT INTO Customers VALUES (4, 'Jim', 'Jones')
INSERT INTO Customers VALUES (5, 'Bob', 'Brown')
INSERT INTO Customers VALUES (6, 'Pam', 'Penny')
Finding the duplicates is relatively simple. You can use a standard SELECT statement to retrieve all rows from the table, grouped by the columns that should be unique. The following query retrieves the distinct set of first and last names. Two aggregate columns are also included. The first of these returns the lowest ID value for the name. The second returns the number of rows that contain the first and last name.
SELECT min(ID), FirstName, LastName, Count(*) FROM Customers
GROUP BY FirstName, LastName
The results of the query are similar to the data that we wish the table to contain after de-duplication. We can therefore simply delete all rows from the table where the ID of the row is not within this list. To do so, execute the following DELETE statement:
DELETE FROM Customers WHERE ID NOT IN
(SELECT min(ID) FROM Customers GROUP BY FirstName, LastName)
To verify that the duplicate rows have been successfully removed, select all data from the table:
SELECT * FROM Customers
27 June 2010