Friday, November 25, 2011

Delete Duplicate Records – Rows

There are different way to do this.

Solution - 1

Table in example is has ID as Identity Column and Columns which have duplicate data are DuplicateColumn1, DuplicateColumn2 and DuplicateColumn3. (The table must have identity column, which will be used to identify the duplicate records.)


DELETE
FROM MyTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)


Solution - 2

set nocount on
drop table tbl1
create table tbl1
(
col1 int
)
insert into tbl1 values(1)
insert into tbl1 values(1)
insert into tbl1 values(1)
insert into tbl1 values(2)
insert into tbl1 values(2)
insert into tbl1 values(2)
insert into tbl1 values(2)
insert into tbl1 values(3)
insert into tbl1 values(3)

set rowcount 1
select ‘start’
while @@rowcount > 0 delete a from tbl1 a where (select count(*) from tbl1 b where a.col1 = b.col1)>1
set rowcount 0
select * from tbl1
set nocount off

Note: This will be a time-consuming method. Alternates are move distinct data to temp table or use row_number() function.

Solution - 3

If you are looking for deleting multiple records (duplicate with more than 2 records like 3,4,5 duplicate records, etc) and without key column then the following method can be adopted.

Lets say your table is tab1.

SELECT DISTINCT * INTO tab2 FROM tab1 — distinct of all columns

DROP TABLE tab1

EXEC sp_rename ‘tab2′,’tab1′

Note:
Take a back up of your original table before using dropping for your reference. You may need to grant permission of tab2 as given for tab1.

Triggers and indexes defined on that table tab1 will have to recreate. To resolve this issue, you may follow Solution - 1 or 4.

Solution - 4

Here is a single query which will answer all your questions but this still needs Primary key or identity column to be added to the table though.
DELETE FROM MyTable
WHERE EXISTS (
SELECT * FROM MyTable AS b
WHERE
b.[col1] = MyTable.[col1]
AND b.[col2] = MyTable.[col2]
AND b.[col3] = MyTable.[col3]
GROUP BY
b.[col1], b.[col2], b.[col3]
HAVING
MyTable.[ID] > MIN(b.[ID])
)
If you want most recent records to be present in your DB and delete multiple Old duplicate records, You can make use of
DELETE FROM MyTable
WHERE EXISTS (
SELECT * FROM MyTable AS b
WHERE
b.[col1] = MyTable.[col1]
AND b.[col2] = MyTable.[col2]
AND b.[col3] = MyTable.[col3]
GROUP BY
b.[col1], b.[col2], b.[col3]
HAVING
MyTable.[ID] MIN(b.[ID])
)
Hope this answers all your questions.
If you dont have Primary key or Identity Column then you can always create it and run this query and delete the Primary Key or identity column.

Solution - 5

This can be easily achieved without crating unique identifier by using CTE and ROW_NUMBER (), the modified query for sql server 2005 goes here.

WITH T1 AS (SELECT ROW_NUMBER ( ) OVER ( PARTITION BY ID, FNAME, LNAME ORDER BY ID ) AS RNUM FROM DUPLICATE )
DELETE FROM T1 WHERE RNUM > 1

Solution - 6

Delete record in a single query its OK. But if you can write subquery, but without use of temp table, without use cursor, and without use of insert identity column, here is the solution.

CREATE TABLE #Table1 (col1 int, col2 int)
INSERT INTO #Table1 VALUES (1, 1000)
INSERT INTO #Table1 VALUES (2, 2000)
INSERT INTO #Table1 VALUES (2, 2000)
INSERT INTO #Table1 VALUES (3, 3000)
INSERT INTO #Table1 VALUES (3, 3000)
INSERT INTO #Table1 VALUES (4, 4000)

SELECT * FROM #Table1;

WITH T1 AS (SELECT (ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1)) AS RNum FROM #Table1)
DELETE FROM T1 WHERE RNum IN (SELECT a.RNum FROM T1 AS a, T1 AS b WHERE a.RNum > b.RNum GROUP BY a.RNum)

SELECT * FROM #Table1

DROP TABLE #Table1

ROW_NUMBER() will solve your purpose.