Code to purge data based on creation date

Only one thing can spark more joy than cleaning up old data. Doggos.

Yesterday, I talked about batching deletes, and the day before I talked about how much I like to delete data, and why it’s necessary to do it, despite being a data packrat.

Today, let’s look at just one scenario, and how I like to handle it. In my experience, this is one of the most common purge requirements (and probably the easiest), so first, let’s look at the business requirements.

Purge requirements

In this scenario, we’re going to keep the data for X days after it’s created. Then we delete it. That’s it. X could be 3 days or 3 years–it doesn’t matter, we’ll follow the same design pattern.
In today’s world, we generate loads of log data, sensor data, telemetry data, etc. All that data is super duper valuable. But only for a while. Eventually, all that granular data becomes less useful, and isn’t worth keeping around. Maybe it gets aggregated, summarized, or maybe it just gets thrown out.

You’ll have a lot of data with more complex requirements, but I think you’ll also be surprised at how much data has simple date-based retention based on it’s creation.

Table design

When I design a table for this type of business requirement, I always like to use the same pattern, unless there’s some reason where this design pattern won’t work:

1) Use an Identity column. If it’s log data, there’s not really a natural key, so using an surrogate key makes sense. Even if there does seem to be a natural key, it’s not a great choice here. The data life cycle is an important part of the data’s identity (see what I did there?), and the fact that the data is only being retained for X days is a good reason to skip the natural key, and use a surrogate key in the form of an Identity column. You should make the ID column the primary key & clustered index.

2) Include the date & time the data was inserted. We’re retaining data based on X days after it’s created….so…we kinda need to know when it was created. Seems obvious, but I’ve seen a lot of tables where the cleanup retention wasn’t considered when the table is created, and only when the table is huge do the retention requirements get considered. You should create a non-clustered index on the creation date.

Both of these columns will be ever-increasing. That means that while data gets inserted to the “end” of the table, we can delete from the “beginning” of the table–and we’ll have two indexes to help us identify what to delete & what to keep. Why don’t I just use the datetime column, and skip the identity column? Date/time makes for a terrible primary key. Sometimes two rows get inserted at exactly the same time, and then your uniqueness goes to hell. ID columns solve that.

If you’re looking for a real world example of someone smarter than me using the design pattern, check out dbo.CommandLog from Ola Hallengren’s SQL Server Maintenance Solution.

CREATE TABLE [dbo].[CommandLog](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [DatabaseName] [sysname] NULL,
    [SchemaName] [sysname] NULL,
    [ObjectName] [sysname] NULL,
    [ObjectType] [char](2) NULL,
    [IndexName] [sysname] NULL,
    [IndexType] [tinyint] NULL,
    [StatisticsName] [sysname] NULL,
    [PartitionNumber] [int] NULL,
    [ExtendedInfo] [xml] NULL,
    [Command] [nvarchar](max) NOT NULL,
    [CommandType] [nvarchar](60) NOT NULL,
    [StartTime] [datetime] NOT NULL,
    [EndTime] [datetime] NULL,
    [ErrorNumber] [int] NULL,
    [ErrorMessage] [nvarchar](max) NULL,
    CONSTRAINT [PK_CommandLog] PRIMARY KEY CLUSTERED ([ID] ASC)
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
);

You’ll notice the ID column for a surrogate key, and the StartTime column for the data creation time. Ola doesn’t include a non-clustered index on StartTime, but I add one to assist in my cleanup. Ola’s code ensures StartTime gets populated reliably, but depending on your situation, you might want to have a default constraint or trigger to ensure it’s populated reliably.

Time to clean up

Let’s look at the algorithm we want to use to do the cleanup. For the sake of demonstration, let’s use Ola’s CommandLog table. And let’s assume we need to retain data in the log for 60 days. Let’s also assume we’ve done some testing and determined that we want to delete in batches of 5000 rows. And like I discussed yesterday, we want to take a break between batches to prevent blowing up the transaction log, so we’ll wait for half a second between each delete.

Step 1) Find the range of ID values we want to delete. We’ll use that date/time column to find the range of IDs.

--StartID is where we start, and MaxID is the maximum ID that we want to delete
SELECT @StartID = MIN(ID), @MaxID = MAX(ID) 
FROM dbo.CommandLog
WHERE StartTime < DATEADD(DAY,-60,GETDATE());

Step 2) First, let’s set up our loop to do the batches. This is the hard part, so I like to get the logic sorted out before I figure out what the actual work/delete statement is.

--We'll start at the start, and increment up until we hit that MaxID
WHILE (@StartID < @MaxID)
    BEGIN 
        --increment @StartID by our batch size
        SELECT @StartID = @StartID + 5000;
        --Delete stuff
        --Wait for half a second before looping again
        WAITFOR DELAY '00:00:00.5'
    END;

Step 3) OK, now we need to write a DELETE statement. That --Delete stuff comment looks great, but it’s not very functional.

DELETE TOP (5000) x 
FROM dbo.CommandLog AS x 
WHERE x.ID < @StartID AND x.ID < @MaxID;

And now we just have to stitch that all together…

--StartID is where we start, and MaxID is the maximum ID that we want to delete
SELECT @StartID = MIN(ID), @MaxID = MAX(ID) 
FROM dbo.CommandLog
WHERE StartTime < DATEADD(DAY,-60,GETDATE());

--We'll start at the start, and increment up until we hit that MaxID
WHILE (@StartID < @MaxID)
    BEGIN 
        --increment @StartID by our batch size
        SELECT @StartID = @StartID + 5000;
        --Delete stuff
        DELETE TOP (5000) x 
        FROM dbo.CommandLog AS x 
        WHERE x.ID < @StartID AND x.ID < @MaxID;
        --Wait for half a second before looping again
        WAITFOR DELAY '00:00:00.5'
    END;

Now you just have to schedule it. Set up a SQL Agent job to run once a week and delete that data.

I have an easy button

I found myself writing code that copies this pattern to purge data over and over and over again. Eventually, I automated it. In my DBA database on GitHub, I’ve created a stored procedure that will handle deletes for this exact scenario.

Instead of writing code specific to the table, I just need to execute my Cleanup_TableByID procedure:

EXEC dbo.Cleanup_TableByID
    @DbName = 'DBA',
    @SchemaName = 'dbo',
    @TableName 'CommandLog',
    @DateColumnName 'StartTime',
    @IDColumnName 'ID',
    @RetainDays int = 60,
    @ChunkSize int = 5000,
    @LoopWaitTime time = '00:00:00.5',
    @Debug bit = 0;

Take a look at my Cleanup_TableByID procedure, and install my DBA database. Installation is easy (directions are in the readme), and it’s free (it’s published under the GPL).


Additional Reading

5 Comments

  1. Good stuff Andy! I have to do this often on non-logging tables that are very active. What I end up doing is putting the clustered index key(s) for the records to be deleted into a temp table, and then creating a clustered index on the temp table that matches the base table. I then delete in batches based on a join between the temp and base tables, using the mutual clustered index key(s) as the join criteria.

  2. You never explicitly say it – but lets say I had such a log table with no StartDate.

    To remedy would you?
    – Add StartDate column with a date of today
    – Set job to delete when StartDate is 60 days ago
    – wait 2 months (just to be sure we don’t delete something we need)
    – enjoy nightly deletes .. eventually

    In this scenario – would you also add in a limiter on your batched delete, be it time based or rowcount based, to prevent that fateful day 60 days in the future from running longer than my maintenance window?

    I think I would – at least temporarily.

    • If I don’t have StartDate, and need to delete based on that, I would absolutely follow the steps you list. Depending on how much pre-existing data I have, I may choose to do a one-time manual cleanup to clear out older data before letting the automation take over.

      That one-time purge require estimating “90 days ago” based on the ID values & rate of insert. Additionally, depending on how much data is being purged in your initial cleanup, it might be easier to do the Truncate Dance:
      * Pull ~60-90 days of data out into a separate table
      * Truncate the existing table
      * Put the ~60-90 days of data back

  3. How would you handle situations where you can’t really delete _everything_ based just on time, but have to delete a person and all of their activity that hasn’t had any activity in ### days? Until that time, we want to keep their data.

1 Trackback / Pingback

  1. Approaches to Deleting Data in Batches – Curated SQL

Comments are closed.