Parsing file names from paths using T-SQL

These files are way neater than the ones on my laptop.

SQL Server is great at a lot of things. I spend a lot of time blogging about all the cool things that SQL Server does. But (you knew there was a “but”, didn’t you) SQL Server isn’t great at everything. Lets write about something that SQL Server is bad at.

SQL Server sucks at parsing strings

That’s not hyperbole. If you’ve tried to write your own function to parse CSVs into a table, you know it’s painful. Judging by the number of comments on this other article about splitting CSVs, this is clearly a topic that people spend a lot of time on. Thankfully, SQL Server 2016 finally introduced a SPLIT_STRING() function that performs well and handles most use cases….but we’re not here to split CSVs.

SQL Server sucks at parsing strings

Yea, I really can’t say it enough. The string manipulation & analysis functions that SQL Server does provide tend to be clunky. I can never remember how they work. Why does CHARINDEX() take three parameters, but PATINDEX() only takes two parameters? Why do those functions only work left-to-right, but not right-to-left? Oh, there’s a REVERSE() function I can apply to the string, but if I reverse the string to search right-to-left, I have to reverse it back on output.

SQL Server sucks at parsing strings

I feel like any time I need to do anything remotely involved, I end up with nested functions. Sometimes a lot of nested functions. Sometimes you have to repeat the same bits of function multiple times. It can make your code hard to read, especially if you don’t already know what it’s supposed to be doing.

Why are we here again?

Oh, right, parsing file & directory names. As a DBA, there seems to be a constant trickle of automation where I need to parse a file name out of a full path. PowerShell does a much better job at this than T-SQL. Go ahead and write a comment below that I should use PowerShell (or Python, or .NET, or some other language) to parse file names. No, really… go ahead, I’ll wait…

Look, if I’m in the database already, sometimes it’s easier to just use T-SQL. It’s convenient. So let’s parse this stuff in T-SQL, even though we know there’s a better way. While I’m being defensive about my coding, I’m only going to deal with Windows paths. Even though SQL Server runs on Linux now, this code won’t work with Linux & their backwards (uhhh….forwards) slashes.

First, let’s take a sample file path that we can use to practice parsing. I’m going to use this file (that I recently wrote about) on my laptop: C:\Users\Andy\Documents\GitHub\dba-database\stored-procedures\dbo.Alert_Blocking.sql

Starting with plain English

Writing & reading code is easier if you understand the logic before attacking the code. I find this to be particularly important when you anticipate complicated code. SQL Server sucks at parsing strings, so I anticipate complicated code.

How do you identify the directory from a file path? That’s just everything up to the last slash–and I like to include that final slash to make it clear it’s a directory.

How do you identify the file name from a file path? That’s just everything after the final slash.

The key here is going to be identifying that final slash, and grabbing the text on either side.

Time to parse strings

We’re going to use CHARINDEX to find the position of the final slash…and here’s the trick… we want to start at the end and work right-to-left. So we’re actually getting the the position from the tail end. We do this because there could be a bunch of slashes in the path, and if we attack this left-to-right, we would need to loop through finding slashes until we stopped finding them.

In this example, we see that the string is 84 characters long, and the final slash is 23 characters from the end:

Lets pop that into a CTE, and use it to grab the stuff from the left & right sides of that slash. Heck, let’s use the LEFT and RIGHT functions to do it.

To get the directory path, we’ll apply the LEFT function to the FilePath, but for how many characters? The total length minus the stuff after the final slash…plus one if we want the directory to include that final slash.

To get the file name, we’ll apply the RIGHT function to the FilePath, and since FinalSlashPos is counting from the right, we’ll use that for the number of characters….minus one so that we don’t include the slash.

We’ve got this far, being very careful to avoid those off-by-one errors with our position:

While we’re parsing stuff, we can parse the file name into the file extension and a “bare” file name.

Getting the file extension is pretty easy. Remember all the logic we just did to parse the file name out of the full path? Do the same thing over again but using a dot instead of a slash.

You might have guessed–getting the bare file name is similar to getting the directory from the path.

Now here’s the catch: This code assumes that you’re actually parsing a Windows file path, and the files all have extensions. In short, if your value is not LIKE '%\%.%', then the code is going to error.

Making the code useful

I don’t parse file paths all that often. In fact, every time it comes up, I usually forget the finer points and have to re-learn how it works. I’m pretty lazy, so I don’t want to have to keep re-learning the same thing. I’d rather put the code in my DBA database, and then just call some function and know that it works. That’s exactly what I’ve done.

I created an inline table-valued function called dbo.ParseFilePath(). Now, if I need to parse files & directories out of paths, I can just APPLY this function.

SQL Server might suck at parsing strings, but at least it’ll be easier next time.

 

12 Comments

  1. I think this would be simpler (and maybe faster) if you replaced:

    FileExtension = RIGHT(FilePath, CHARINDEX(‘.’, REVERSE(FilePath)) -1),
    with:
    FileExtension = PARSENAME(FilePath, 1),

    • Hey Simon– I possibly could use `PARSENAME` for the file extension in most cases, but following the same parsing pattern is also simpler (albeit in a different way than perhaps you use the word). Performance-wise, why don’t you give it a try and report back? (Be sure to make sure it works on files with many periods in the file name– Server.database.full.year.month.day.hour minute.second.compressed.BAK is a silly, but perfectly valid filename.)

    • @SimonCBirch,

      Excellent suggestion on using PARSENAME but the “ObjectName” operand of PARSENAME is a pretty good clue as to why you might not want to use PARSENAME to try to isolate the extension. The “ObjectName”, as it is in many functions, is of the datatype “SYSNAME”, which is an alias for NVARCHAR(128). If your full file path is longer than that (and it frequently is nowadays), PARSENAME won’t work correctly. Further, if there is no extension, which also means there is no “dot” in the path name, PARSENAME([objectname],1) will return the entire “ObjectName”.

  2. Maybe you should consider naming the function dbo.ParseFullFilePath(). Cause it fails for short relative file paths like ‘test.sql’. Also it parses imho .\files.tar.gz incorrectly. Cause it has two extensions, but returned as one, it should be .tar.gz instead of .gz.

    • Should I hard-code .tar.gz to handle the double-file extension, to avoid breaking the function for other files that just happen to have a dotted file name (like dbo.Alert_Blocking.sql, which I use in the examples)?
      Even the, .tar.gz is a convention for files that are tarballed, then g-zipped. However, technically (pedantically) speaking, these are two separate file extensions that represent two separate things. You would tarball MyStuff into MyStuff.tar. Then you would g-zip that into MyStuff.tar.gz. To un-do the process, you need to first un-g-zip, then un-tarball. Sure, some applications will do both steps to make it easier on you, but it remains a two-step process.

      • Double file extensions are common, perhaps more common than files with period characters in the file name, though the distinction is not objective.
        There are many types of so-called “tar ball”: (TAR.GZ / TAR.BZ[2] / TAR.XZ), so I would propose that if the file name contains ‘.tar.’, that ‘tar’ and all after represent the extension. More safely, consider ‘tar’ only if it is the second-to-last extension, which will allow future compression formats, e.g. ‘foo.tar.abc’, and non-conforming names such as ‘foo.tar.gzip’

  3. This is a perfect example where SQL CLR is much better suited than T-SQL for working a problem. File properties such as name, create date, modified date, extension, path, read-only attribute, archive bit… all of these and more are available directly in .NET via CLR, without the hassle or risk of error with a roll-your-own solution. There are certainly security questions to be answered before enabling CLR, yet despite years of secure deployments, resistance still exists to using it. (A pressing need for a robust TVF that would return these properties for files from a passed-in directory path was one reason why I decided to dive into CLR a few years back. While it took a bit of one-time fiddling in SSMS with keys, logins, and assemblies, it did not, as some people think, open a glaring security portal or require setting TRUSTWORTHY ON.)

  4. If you change the FinalSlashPos calculation to CHARINDEX(‘\’, REVERSE(‘\’ + @FilePath), 1), you’ll eliminate errors when a filename is passed without a path.

  5. Parsing Windows File Services?

    Yeah, PowerShell is light years ahead of this, I guess this works, but its really too much work
    How about something like this (or 6 other PowerShell ways to parse files):

    PushD ~MyNetwork-Drive~ # Connect to a folder Windows knows about
    $Here = Get-Location # Grab the location of this drive and foldername
    $This-Drive-Letter = Split-Path -Path $(Get-Location) -Qualifier # Grab only the drive letter of this location
    $This-Folder-Name = Split-Path -Path $Get-Location) -NoQualifier # Grab only the foldername of this location

    C:\PS> Dir | Select FullName, Name, Length, LastWriteTime | Format-Table -Autosize # Grab the files in this folder, list the full pathname, filename, file in bytes, and date of last change – present in columns

    And like I say, there’s 5 other ways to bring all this goodness out to use.

    Good luck, but first, make a short visit to your friendly PowerShell web site to pick up some tips there.

    Regards-

    • Yup. Like I said in the post, “PowerShell does a much better job at this than T-SQL.”
      Sometimes it makes more sense to walk someplace instead of driving the Maserati.

    • Wait, Powershell is better at parsing files? T-SQL is terrible at it? Wow, it’s almost like Andy didn’t mention that in the first page of the post.

      Oh, and I have this relatively small table and I need to parse some file paths. Now I have about 6 pages of T-SQL code around it for other purposes including some fairly complex queries that T-SQL is really good at. Is your suggestion that I re-write all of that code into Powershell? Or maybe run some T-SQL, then switch to Powershell, and then go back to T-SQL? Note: I’m not actually going to be using the results for anything, just storing them in other columns. Oh, and I should mention my “small” table has ~30 million rows. How is the performance on Powershell going to be at that volume?

      Now please understand, I’m not knocking Powershell. It’s an amazing tool. Knowing how to parse a path in Powershell is important, as is knowing Powershell in general. But it’s also important to know how to do things multiple ways, and to choose the best tool for the specific job you are working.

      Regards
      Kenneth

1 Trackback / Pingback

  1. Dealing With String Parsing In T-SQL – Curated SQL

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.