Removing duplicate rows in small batches based on date column

Due to double scheduling, some duplicate rows were inserted into the Windows.TableStats table in JiMetrics. To confirm that the table has duplicates, here is the T-SQL script I used. Note that this same script should work in all other major RDBMS platforms like MySQL, Oracle, etc. Adjust table and column names to fit your needs.

[code language=”sql”]
SELECT HostID,
InstanceID,
DbName,
SchemaName,
TableName,
CAST (CollectionDate AS DATE),
COUNT (*)
FROM Windows.TableStats
GROUP BY HostID,
InstanceID,
DbName,
SchemaName,
TableName,
CAST (CollectionDate AS DATE)
HAVING COUNT (*) > 1
[/code]

So the duplicates need to be removed. In Microsoft SQL Server, that can be accomplished with a Common Table Expression (CTE) using the ROW_NUMBER ranking function, which is a pretty elegant solution.

However, when using DELETE to remove potentially large number of rows, say millions, it is always advisable to do that in small batches. Otherwise, you are risking running out of temp, log space, or even disk space, not to mention that you are potentially creating bigger and coarser locks than necessary.

So I decided to remove the duplicates on a daily basis to lessen the impact on the database instance. A runningDate variable is used and incremented by one until the desired end date. Within each iteration of the loop, that day’s duplicates are removed. Tweak it to suit your needs.

I think things like this make a good job interview question for a database administrator.

[code language=”sql”]
DECLARE @runningDate DATE
SET @runningDate = ‘20140101’

WHILE (@runningDate < '20140406') BEGIN WITH dupRows AS (SELECT RecordID, ROW_NUMBER () OVER (PARTITION BY HostID, InstanceID, DbName, SchemaName, TableName, cast (CollectionDate AS DATE) ORDER BY HostID, InstanceID, DbName, SchemaName, TableName) AS RankID FROM Windows.TableStats WHERE CAST (CollectionDate AS DATE) = @runningDate) DELETE dupRows WHERE RankID > 1

SET @runningDate = DATEADD (DAY, 1, @runningdate)
END
[/code]

The Ji Village News

Removing duplicate rows in small batches based on date column

Leave a Reply