The thrill of purging old data

I get a thrill from roller coasters, but for some people it just makes them purge their lunch.

DBAs & Data professionals tend to be digital pack rats. My personal Google Drive is filled with random files that I refused to part with–even though it’s of questionable value. I lament that I do not have a copy of every paper I typed in high school or college–and I envy those who do. You never know when you’ll need notes from that meeting 6 years ago, or when you’ll need that funny GIF you created for an inside joke you no longer remember.

Still, at work, we are often challenged with the need to delete or purge data. Old data doesn’t need to be retained forever. It may no longer be of use to the business, or it may be archived somewhere safe (just in case!). At some point, the cost of retaining the data exceeds the value of keeping it. In some cases, keeping it longer than necessary may even be a liability–in a time when data breaches are so common, hoarding data makes the inevitable breach much more painful. Of course, there’s GDPR to consider–EU citizens can request to have their data removed from your system.

Thus, we must eventually delete data. I’ll admit, I’m an adrenaline junkie for the rush I get from running TRUNCATE TABLE in production. It’s better than any drug I can buy on the black market. Sometimes I wonder if I could ever get hooked on truncating tables. However, this need to delete data comes with some practical complications, too.

It’s easier if you plan to delete data

If you’re with me that deleting data is inevitable, then it’s best to plan for it with your schema design. How long will you keep data before you delete it? Is the retention based on the creation of the data? Is it based on the last time it was modified? Maybe it’s based on whether a user is active or not (defining “active user” might turn into a whole other ordeal!)…

Regardless of the criteria, take a moment to think about this when you’re designing your schema. We usually get distracted by our primary responsibility of making sure data is always available when we need it–and we forget that we also need to make sure data is unavailable when it’s no longer needed. Are we collecting the right information when the data is created, so that we can delete it when the time comes?

If retention is based on creation time, we’ll need to make sure we retain a CreateDate (or whatever you want to call it…feel free to blow up the comments because of my terrible naming convention) on the row. If the retention is based on that “active user” scenario, maybe you need to track when a user last logged into the system… Most likely, it’s not the DBA or developer who will decide these rules–it’s the business users, legal department, or Product team who will make the decisions. We, as data professionals, need to make sure we ask the right questions at design time to ensure that we can accommodate the business requirements later. Should it be deleted after a certain amount of time? What is that amount of time? What action starts (or resets) the clock for deletion?

Having those conversations up front lets you ensure you have the data & processes to delete it later. Have a new table, with three year data retention? Set up the process to delete it NOW, not in three years. You might have a new job in three years–kicking the can down the road isn’t a smart idea. Set up the purge process today even if it doesn’t have to do work for a while. That will ensure that your data is treated properly when the time comes.

You can’t delete everything

As fun as it is to delete stuff, you can’t delete everything. There’s a saying that “you can’t un-ring a bell”–and sometimes data is that bell. If you sold a widget, you probably need to retain all the details of that sale for a number of years for legal reasons. Finance people get really upset if last year’s financial data changes (or is unavailable) because I deleted some data. Oops.

No matter how much of a thrill you get from deleting stuff, you’ll need to handle every data point on a case-by-case basis. That means you’ll need somewhere to document these details, too. It doesn’t have to be fancy–but it has to happen, and it has to be consistent. Make a note of how long you need to keep the data, and who made that decision. It could be as simple as some comments accompanying the table definition in source control. Or using extended properties on the table to track it’s retention requirements.

The important thing is to think about it in advance, and keep track of those decisions.

Now, if you’ll excuse me, I’m going to go truncate dbo.SalesOrderDetail for an adrenaline rush.


Additional Reading