08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

Duplicates or duplications can be defined as "result rows that are identical to one or more other rows" [1]. They occur frequently in data collections (e.g. Excel, Access, etc.) with concurrent access that have minimum or no validation at all. More users and/or more frequent the updates on the same information, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example, in information systems the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.

I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:

-- create test table
CREATE TABLE [dbo].[Assets](
 [ID] [int] NOT NULL,
 [CreationDate] smalldatetime NOT NULL,
 [Vendor] [varchar](50) NULL,
 [Asset] [varchar](50) NULL,
 [Model] [varchar](50) NULL,
 [Owner] [varchar](50) NULL,
 [Tag] [varchar](50) NULL,
 [Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

Here's some test data:

-- insert test data (SQL Server 2000+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO dbo.Assets
VALUES  ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO dbo.Assets
VALUES  ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES  ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO dbo.Assets
VALUES  ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    
Let’s check table’s content:

-- review the data
SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
ID CreationDate Vendor Asset Model Owner Tag Quantity
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:

-- retrieve the duplicates
SELECT Vendor, Tag
FROM dbo.Assets A    
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output:
Vendor Tag
Dell DD0001
Microsoft WX0001

In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

-- retrieve duplicates with details
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- duplicates
	  SELECT Vendor, Tag
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  HAVING COUNT(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:

-- select unique records
SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

-- retrieve duplicates via ranking functions 
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery 
 SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
 , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING 
 FROM dbo.Assets 
) A 
WHERE RANKING = 1

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. 

Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.

Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

The same technique can be used to delete the duplicates from a table:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.

Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:

-- nonduplicated records (SQL server 2000+)  
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- last entry
	  SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:

-- insert test data (SQL Server 2005+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:
 
-- insert test data (Oracle)
INSERT INTO dbo.Assets<br />
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
[1] S Hull (1998) Business Objects Glossary
[2] Microsoft Learn (2023) dense_rank() (link)
[3] Microsoft Learn (2023) row_number() (link)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.