Using regular expressions to manipulate data in a SPARQL query

A pure, standards-compliant SPARQL query.

February 25, 2024

I have often lamented that SPARQL’s REGEX function only returned a boolean value. It’s handy in FILTER tests because it lets you use regular expressions to create more complex conditions about the results that you do or don’t want returned by a query, but instead of just returning True or False I wished that it would let me grab the pieces of a string that match the regular expression pattern and recombine them into new values, like I can with the regular expression support of most programming languages.

I only recently noticed that SPARQL’s REPLACE function, which comes right after REGEX in the SPARQL query specification, supports regular expressions, so I can do this regex string manipulation in SPARQL after all.

One of those other languages is JavaScript. In Calling your own JavaScript functions from SPARQL queries I showed how once you write a JavaScript function that does some regex string manipulation, you can then call that function from a SPARQL query being executed with Jena ARQ. (Soon I’ll be showing how to do that with GraphDB on the Ontotext blog.) The demo in my earlier blog entry used a regular expression in a JavaScript function to normalize some U.S. phone numbers.

The SPARQL query below demonstrates why I didn’t need to call those JavaScript functions. Using SPARQL’s REPLACE function and the same input data as that demo, I can normalize the same phone numbers using nothing but pure W3C-compliant SPARQL.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX v:    <http://www.w3.org/2006/vcard/ns#>

SELECT ?name ?phoneNum ?fixedPhone
WHERE {
    ?s v:given-name ?name ;
  v:homeTel ?phoneNum .
  BIND (replace(?phoneNum,".*(\\d\\d\\d).*(\\d\\d\\d).*(\\d\\d\\d\\d).*",
                "$1-$2-$3") AS ?fixedPhone)
}

The regular expression in the replace() function call’s second argument looks for two three-digit sequences and then a four-digit sequence, ignoring everything before, after, or in between. Then it returns the found strings separated by hyphens.

Here is the sample data from that earlier blog entry; note the different punctuation and spacing used with the four phone numbers:

@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix d: <http://learningsparql.com/ns/data#> .

d:i9771 v:given-name "Cindy" ;
        v:homeTel "1 (203) 446-5478" .

d:i0432 v:given-name "Richard" ;
        v:homeTel "   (729)556-5135   " .

d:i8301 v:given-name "Craig" ;
        v:homeTel "9232765135" .

d:i8309 v:given-name "Leigh" ;
        v:homeTel "843-5544" .

The result after running the query above with this data shows the phone numbers from the data and the results of the replace() calls:

name	phoneNum	fixedPhone
Craig	9232765135	923-276-5135
Leigh	843-5544	843-5544
Richard	(729)556-5135	729-556-5135
Cindy	1 (203) 446-5478	203-446-5478

As the SPARQL query spec tells us, this function corresponds to the XPath fn:replace function. That leads to more documentation, which points to a separate Regular expression syntax section that lists available flags such as i for case-insensitive matching and m for multiline matching.

Those links ultimately lead to an escape character table in the XML Schema Part 2 specification. This table tells us the typical regular expression codes—for example, that \s matches white space characters and \d matches a numeric digit. Note that when I used the \d codes in the SPARQL query above they’re in a quoted string, so the backslash itself needed escaping; that’s why you see two backslashes before each d in my query’s regular expression.

The REPLACE function’s ability to find substrings and delete or rearrange them in RDF literal data should be very handy for data cleanup and enhancement. I’m sorry I didn’t notice it before!

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Excerpt from xkcd comic by Randall Monroe, CC BY-NC 2.5 DEED.

Selecting all the triples from all the graphs

Editing schemas, ontologies, and SKOS taxonomies with VocBench

SPARQLing anything

Querying for audio on Wikidata

Use SPARQL to query for movies, then watch them

SPARQL queries of the Billboard Hot 100

Visualizing RDF