Removing duplicate rows in small batches based on date column

Due to double scheduling, some duplicate rows were inserted into the Windows.TableStats table in JiMetrics. To confirm that the table has duplicates, here is the T-SQL script I used. Note that this same script should work in all other major RDBMS platforms like MySQL, Oracle, etc. Adjust table and column names to fit your needs.

SELECT HostID,
       InstanceID,
       DbName,
       SchemaName,
       TableName,
       CAST (CollectionDate AS DATE),
       COUNT (*)
  FROM Windows.TableStats
GROUP BY HostID,
	 InstanceID,
         DbName,
         SchemaName,
         TableName,
         CAST (CollectionDate AS DATE)
HAVING COUNT (*) > 1

So the duplicates need to be removed. In Microsoft SQL Server, that can be accomplished with a Common Table Expression (CTE) using the ROW_NUMBER ranking function, which is a pretty elegant solution.

However, when using DELETE to remove potentially large number of rows, say millions, it is always advisable to do that in small batches. Otherwise, you are risking running out of temp, log space, or even disk space, not to mention that you are potentially creating bigger and coarser locks than necessary.

So I decided to remove the duplicates on a daily basis to lessen the impact on the database instance. A runningDate variable is used and incremented by one until the desired end date. Within each iteration of the loop, that day’s duplicates are removed. Tweak it to suit your needs.

I think things like this make a good job interview question for a database administrator.

DECLARE @runningDate DATE
SET @runningDate = '20140101'

WHILE (@runningDate < '20140406')
BEGIN
   WITH dupRows
        AS (SELECT RecordID,
                   ROW_NUMBER ()
                      OVER (PARTITION BY HostID,
                                         InstanceID,
					 DbName,
                                         SchemaName,
                                         TableName,
                                         cast (CollectionDate AS DATE)
                            ORDER BY
                               HostID,
			       InstanceID,
                               DbName,
                               SchemaName,
                               TableName)
                      AS RankID
              FROM Windows.TableStats
             WHERE CAST (CollectionDate AS DATE) = @runningDate)
   DELETE dupRows
    WHERE RankID > 1

   SET @runningDate = DATEADD (DAY, 1, @runningdate)
END

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.