I hate sp_estimate_data_compression_savings


If you’re working with compressed indexes, SQL Server provides a system stored procedure to help test the space savings of implementing data compression: sp_estimate_data_compression_savings. Starting in SQL Server 2019, it can even be used to estimate savings with columnstore.

I really don’t like sp_estimate_data_compression_savings. In fact, I kind of hate it. It’s not always very accurate–and even when it is accurate, the results can be misleading. Before I get ranty about why I don’t like it, let’s look at it in action.

First, some data to play with

We’ll create a fresh new database, with a fresh table, and pump it full of data:

USE master
-- Create a DB to play in
DROP DATABASE IF EXISTS CompressionTest;
GO
CREATE DATABASE CompressionTest;
ALTER DATABASE CompressionTest 
    SET RECOVERY SIMPLE;
GO
USE CompressionTest
GO
CREATE TABLE dbo.ReallyBigTable (
    ReallyBigTableID bigint identity(1,1),
    StoreID tinyint,
    CustomerID int,
    ProductID int,
    TransactionTypeID int,
    CONSTRAINT PK_ReallyBigTable PRIMARY KEY CLUSTERED (ReallyBigTableID)
);
GO
INSERT INTO dbo.ReallyBigTable (StoreID, CustomerID, ProductID, TransactionTypeID)
SELECT  StoreID             = c1.column_id,
        CustomerID          = c1.object_id,
        ProductID           = c2.object_id%10000,
        TransactionTypeID   = c2.system_type_id
FROM sys.columns AS c1
CROSS JOIN sys.columns AS c2;
GO 100

Estimate the space savings now

First, let’s take a look at the size of the table:

EXEC sp_spaceused ReallyBigTable;

The table is 104 million rows, and 2.99 GB.

Now, let’s take a look at what sp_estimate_data_compression_savings estimates for space savings with different levels of data compression:

EXEC sp_estimate_data_compression_savings 'dbo','ReallyBigTable',NULL,NULL,'NONE';
EXEC sp_estimate_data_compression_savings 'dbo','ReallyBigTable',NULL,NULL,'ROW';
EXEC sp_estimate_data_compression_savings 'dbo','ReallyBigTable',NULL,NULL,'PAGE';

A selection from the results estimate the size as thus:

Compression Level Size with
current compression (KB)
Estimated size with
new compression (KB)
Estimated savings
NONE 3120936 3107200 0.44%
ROW 3120936 3120936 0.00%
PAGE 3120936 1364376 56.28%

I’m not going to dispute any of those numbers….at least not right now. This table is a tidy little thing that was freshly built with very full pages… What happens if we play around with the data some?

SET NOCOUNT ON;
SELECT 1;
WHILE @@ROWCOUNT > 0
BEGIN
    DELETE TOP (1000000) rbt
    FROM dbo.ReallyBigTable AS rbt
    WHERE ReallyBigTableID%2 = 0;
END;

I’ve essentially just deleted half the rows by deleting everything with an even ID number. This is a good way to simulate low page density. (In real life, you’d get here through “normal” means of page splits and whatnot.) If we look at the size of the table now:

EXEC sp_spaceused ReallyBigTable;

We can see that the number of rows is half what it was at the start (as intended), but the size hasn’t changed any. Fewer rows in the same amount of space indicates we’ve got low fill on these pages. We can confirm that by looking at the index stats. This query confirms we’re at 49.8% page fullness:

SELECT avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats  
    (DB_ID(), OBJECT_ID(N'dbo.ReallyBigTable'), NULL, NULL , 'DETAILED')
WHERE index_level = 0;  

OK, now that I’ve messed with the page fullness, let’s look at the index compression estimates again. This gives a better “real life” scenario of running this procedure on Prod, where your indexes are likely not all tidy, freshly rebuilt and perfect:

EXEC sp_estimate_data_compression_savings 'dbo','ReallyBigTable',NULL,NULL,'NONE';
EXEC sp_estimate_data_compression_savings 'dbo','ReallyBigTable',NULL,NULL,'ROW';
EXEC sp_estimate_data_compression_savings 'dbo','ReallyBigTable',NULL,NULL,'PAGE';
Compression Level Size with
current compression (KB)
Estimated size with
new compression (KB)
Estimated savings
NONE 3120936 1547360 50.42%
ROW 3120936 1711280 45.17%
PAGE 3120936 1452264 53.47%

Now these numbers look pretty odd. There are a few odd things here:

  • It looks like all three compression levels will save about the same amount of space.
  • With half as many rows, page compression estimates are larger than the prior estimate for the full sized table.
  • The table currently has no compression, but estimating for NONE shows significant space savings

I don’t trust those estimates

Just like I don’t fully trust estimated execution plans, and I don’t trust estimated compression savings. I want to see the actual results. Let’s rebuild my table with each compression level, and see what the size is after each rebuild:

ALTER INDEX PK_ReallyBigTable ON dbo.ReallyBigTable REBUILD WITH (DATA_COMPRESSION=NONE);
EXEC sp_spaceused ReallyBigTable;
ALTER INDEX PK_ReallyBigTable ON dbo.ReallyBigTable REBUILD WITH (DATA_COMPRESSION=ROW);
EXEC sp_spaceused ReallyBigTable;
ALTER INDEX PK_ReallyBigTable ON dbo.ReallyBigTable REBUILD WITH (DATA_COMPRESSION=PAGE);
EXEC sp_spaceused ReallyBigTable;
Compression Level Actual size (KB) Previous Estimate (KB) Variance from Estimate
NONE 1556496 1547360 0.6%
ROW 850192 1711280 101.3%%
PAGE 639624 1452264 127.0%

Well, I was right–those were some terrible estimates. How can they be off by more than double?

Why are the estimates so bad?

Estimated compression savings are estimates–they aren’t intended to be exact. Like with any estimate, it’s somewhere between a guess and a calculation. That calculation/guess is only going to be as good as the data it’s based on.

The sp_estimate_data_compression_savings procedure works by collecting a sampling of data, compressing it, and extrapolating the results. The procedure’s algorithm is roughly:

  • Create a temp table called #sample_tableDBA05385A6FF40F888204D05C7D56D2B (yes, that exact name is hard-coded into the procedure) that matches the specified table in schema & compression
  • Insert a sample of data from the specified table into the temp table. If the table (the actual row data: clustered index or heap) is less than 5000 pages, then use a 100% sample, else calculate the sample percent using the formula: 100 * 5000 / used page count
    • The exact rows of the sample is determined by the undocumented table-valued function sys.generate_table_sample_ddl(), which is hidden away in the resource database.
    • sys.generate_table_sample_ddl() simply uses the SELECT...FROM [table] TABLESAMPLE(N percent); to return that sample
  • Create the specified index(es) on the temp table
  • Check the size of the sampled temp table & it’s indexes
  • Compress with the specified compression level
  • Get the newly compressed size of the table & indexes
  • Extrapolate the sample size/results out to the full table results
  • Clean up temp objects

I have two big problems with this algorithm:

  1. The sampling is unscientific, and you aren’t able to influence sample size. In my example, the table is 3,120,936 KB, or 390,117 pages. Using the formula above, that is 1.28%. If you’re familiar with statistics in SQL Server, then you probably know that under-sampling can cause performance problems, and selecting an ideal sample size can be a complicated task. Similarly, a low-sampled compression estimate may not represent your data well, and can result in bad estimates.
  2. The pre-compression sizes are based on current index size, but the post-compression sizes are based on ideally-rebuilt indexes. This makes for an apples-to-oranges comparison when trying to judge your compression savings. To do an apples-to-apples comparison, you really want all size comparisons to be based on freshly rebuilt indexes.

How do I estimate space savings?

I don’t estimate. I test it for real.

My ideal way to test compression space savings is to restore the database to a non-production server, then perform index rebuilds at NONE, ROW, and PAGE compression levels. Then I can use the actual space savings to compare my current compression setting to the other two, and calculate real-world results. Having the full database restored in non-Production, with the changed compression level will also allow me to do some performance testing. I want to see how the changed compression settings affect both my index/statistics maintenance jobs, and the standard workload that uses the table. Compression introduces CPU overhead on both reads & writes, so you’ll want to make sure you test that aspect, and weigh performance impact against size benefit.