When STRING_SPLIT() falls short

Aaron attempting a STONE_SPLIT at a recent curling competition.

This morning, my friend Aaron Bertrand (blog|twitter) posted a plea for help with STRING_SPLIT() improvements. I’m very much in Aaron’s corner on this, having written about the olden days myself.

Why STRING_SPLIT() is supposed to be awesome

CLR was fast, but inaccessible to many DBAs. XML was almost as good, but ugh, XML. And multi-statement table valued functions are evil but prevalent.

STRING_SPLIT() promised to be both super fast, and lift the code maintenance problem for DBAs who don’t know .NET and struggle with the crazy XML syntax.

Except, as Aaron explains, it falls short on everything but the most basic scenario.

My anecdote

At a prior gig, I helped support a database that backed an app written with (older versions of) Java & Hibernate. Java/Hibernate don’t support table-valued parameters, so it was a very common scenario to pass in a CSV of ID values, and split it in the database. Hibernate’s alternative is to parameterize the single ID, then iterate row by agonizing row over the set–which is painful and an even worse choice than a table-valued multi-statement split function.

After upgrading to SQL Server 2017 (my first project of the gig), we started moving our biggest “problem queries” to stored procedures, and making use of STRING_SPLIT(). Hibernate would continue to pass in its [n]varchar CSV of values, and then in the stored procedure, we’d throw in a CROSS APPLY STRING_SPLIT(@csv) or WHERE SomeID IN STRING_SPLIT(@csv). In most cases, we found that this approach didn’t give us the big awesome performance gain we were looking for. In some cases, the query got worse!

Worse, you ask?

Yep. Worse. I’m a little fuzzy on the details, and too busy/lazy to build the repro (if you have one, send it along & I’ll give you the credit for picking up my slack!), but here’s the gist. The @csv value might be a couple of hundred characters long, which would cause the function to use a return type of varchar with a length matching @csv. This made for some wonky memory grants, and added implicit conversions. While the implicit conversions went in the right direction (varchar–>int), it’s an extra step. And every extra step increases the number of possible plans, and increased the chances that you get a suboptimal plan–and we got suboptimal plans in spades!

We were also occasionally being bit by duplicate values in the CSV being returned twice in the result set–an application bug, sure, but one that the prior homemade function defended against.

The workaround in our case was easy. At the start of the procedure, we’d have something like this:

DECLARE @IdList TABLE (Id int);

INSERT INTO @IdList (Id)
SELECT DISTINCT value
FROM STRING_SPLIT(@csv);

Then, we’d use @IdList in the query, rather than an inline STRING_SPLIT(). Sure, its not that difficult of a workaround, but when you’ve got THOUSANDS of places where you’re looking to drop in this (theoretically) better-performing option, the extra step adds to complexity & slows down adoption.

Go vote!

Go vote on Aaron’s UserVoice post, but don’t JUST vote. Compelling user stories are the most valuable thing in convincing the SQL Server Product Team to make a change. Tell your own story for what STRING_SPLIT enhancements are important to you, either on Aaron’s blog post, or on this one.