Using regular expressions to manipulate data in a SPARQL query

A pure, standards-compliant SPARQL query.

xkcd frame

I have often lamented that SPARQL’s REGEX function only returned a boolean value. It’s handy in FILTER tests because it lets you use regular expressions to create more complex conditions about the results that you do or don’t want returned by a query, but instead of just returning True or False I wished that it would let me grab the pieces of a string that match the regular expression pattern and recombine them into new values, like I can with the regular expression support of most programming languages.

I only recently noticed that SPARQL’s REPLACE function, which comes right after REGEX in the SPARQL query specification, supports regular expressions, so I can do this regex string manipulation in SPARQL after all.

One of those other languages is JavaScript. In Calling your own JavaScript functions from SPARQL queries I showed how once you write a JavaScript function that does some regex string manipulation, you can then call that function from a SPARQL query being executed with Jena ARQ. (Soon I’ll be showing how to do that with GraphDB on the Ontotext blog.) The demo in my earlier blog entry used a regular expression in a JavaScript function to normalize some U.S. phone numbers.

The SPARQL query below demonstrates why I didn’t need to call those JavaScript functions. Using SPARQL’s REPLACE function and the same input data as that demo, I can normalize the same phone numbers using nothing but pure W3C-compliant SPARQL.

PREFIX rdfs: <>
PREFIX v:    <>

SELECT ?name ?phoneNum ?fixedPhone
    ?s v:given-name ?name ;
  v:homeTel ?phoneNum .
  BIND (replace(?phoneNum,".*(\\d\\d\\d).*(\\d\\d\\d).*(\\d\\d\\d\\d).*",
                "$1-$2-$3") AS ?fixedPhone)

The regular expression in the replace() function call’s second argument looks for two three-digit sequences and then a four-digit sequence, ignoring everything before, after, or in between. Then it returns the found strings separated by hyphens.

Here is the sample data from that earlier blog entry; note the different punctuation and spacing used with the four phone numbers:

@prefix v: <> .
@prefix d: <> .

d:i9771 v:given-name "Cindy" ;
        v:homeTel "1 (203) 446-5478" .

d:i0432 v:given-name "Richard" ;
        v:homeTel "   (729)556-5135   " .

d:i8301 v:given-name "Craig" ;
        v:homeTel "9232765135" .

d:i8309 v:given-name "Leigh" ;
        v:homeTel "843-5544" .

The result after running the query above with this data shows the phone numbers from the data and the results of the replace() calls:

name phoneNum fixedPhone
Craig 9232765135 923-276-5135
Leigh 843-5544 843-5544
Richard (729)556-5135 729-556-5135
Cindy 1 (203) 446-5478 203-446-5478

As the SPARQL query spec tells us, this function corresponds to the XPath fn:replace function. That leads to more documentation, which points to a separate Regular expression syntax section that lists available flags such as i for case-insensitive matching and m for multiline matching.

Those links ultimately lead to an escape character table in the XML Schema Part 2 specification. This table tells us the typical regular expression codes—for example, that \s matches white space characters and \d matches a numeric digit. Note that when I used the \d codes in the SPARQL query above they’re in a quoted string, so the backslash itself needed escaping; that’s why you see two backslashes before each d in my query’s regular expression.

The REPLACE function’s ability to find substrings and delete or rearrange them in RDF literal data should be very handy for data cleanup and enhancement. I’m sorry I didn’t notice it before!

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Excerpt from xkcd comic by Randall Monroe, CC BY-NC 2.5 DEED.