Partitioning and existing Indexes
Question
I have a table that consumes almost 1.2 TB of space and has lots of historical data which are mostly irrelevant. Since performing bulk DELETE is going to cause a lot of heavy-lifting, I’m trying to use the concept of horizontal partitioning to create multiple partitions, each containing early data (for eg: 2017,2018… 2021) so that I can simply truncate those partitions which is less than an year, so that I can only retain the relevant operational data.
This is the plan of action that I have in my mind
- Create a partition function that would split the data on a yearly range
- Define the partition Scheme.
- Modify the table to use and apply partition
- Truncate partitions containing irrelevant historical data
I stated the above operations in SQL Server (except the truncate part) like
-> Create partition CREATE PARTITION FUNCTION Yearly_Exp (datetime2) AS RANGE RIGHT FOR VALUES ('2016-01-01 00:00:00', '2017-01-01 00:00:00', '2018-01-01 00:00:00' '2019-01-01 00:00:00', '2020-01-01 00:00:00', '2021-01-01 00:00:00'); -> Define the partition Scheme. CREATE PARTITION SCHEME Yearly_Scheme AS PARTITION Yearly_Exp ALL TO ([PRIMARY]); -> Modify the table to use and apply partition CREATE UNIQUE CLUSTERED INDEX Partition_Ind ON existing_table(some_timestamp_col, id) ON Yearly_Scheme(some_timestamp_col);
Now coming to my actual question, since the table has been around for a while, there are pre-existing indexes in the table. To be more precise the table has
- Two non-clustered indexes in which one of the indexes has my partitioning key as the index column
- One clustered Index applied through the Primary Key constraint.
Should I take any action to alter these indexes prior to partitioning? (Since there is an index already present with the partitioning key as the indexing column) or should I leave them as it is?
Does the index performance gets affected after partitioning?
asked 2021-09-12 by Haagenti
Answer
To summarize your goals, you want to avoid the "heavy lifting" of large or batched deletes to minimize blocking & preserve uptime of your application.
I don’t think your current plan will accomplish that goal
I don’t think your current plan will accomplish that goal, because partitioning involves re-clustering the table, which is a size-of-data operation and can be quite impactful.
SQL Server can only have one clustered index. Your table already has a clustered index on your PK. Therefore, you would need to drop the existing clustered index (temporarily creating a heap), then add the new clustered index, then add a non-clustered PK constraint to replace the dropped clustered PK. The time while the table is temporarily a heap would have a performance impact, until both the new PK & new Clustered index are created.
But the operation would also cause all non-clustered indexes to be re-written twice. The way non-clustered indexes are stored, they contain the clustering key or a Row ID (for heaps) to point back to the rest of the row. Dropping the existing clustered index would re-write the non-clustered indexes to change the clustering key to a RID, then adding the new clustered index would re-write them again to change the RID reference to the new clustering key. The re-writing of all the non-clustered indexes would happen in a single transaction, which on a large table would be a long/large transaction and could contribute to significant log growth.
Functionally, you could add steps to drop the existing clustered PK, and then add it back as a non-clustered PK, and your plan might produce the desired end result, but would be more impactful than you intend.
Instead, you should consider migrating to a new table.
Instead, you should consider migrating to a new table. I wrote a detailed blog post about Changing a column from int to bigint, without downtime, but the same process can be used for your need.
- Create a new, empty table that has your new (partitioned) structure.
- Create a trigger on the old table to double-write inserts/updates/deletes to the new table.
- Migrate the data just you want to keep from the old table to the new table. If there is data you want to purge now, don’t bother migrating it. Use batches to traverse the PK to copy all data to the new table (This is easiest with an
identity
column, but possible with any PK). - Rename the old table to a new name, and rename the new table to the original name.
- Drop the old table.
This is still a (single) table rebuild, but preserves uptime by making the "rebuild the whole table" process a single background process. Building the new table can be done in the days or weeks in advance of the table rename, in a resumable, controllable fashion. It also reduces the impactful "switch" to a relatively fast table rename, rather than several large index rebuilds.
answered 2021-09-12 by Andy Mallon