Federated SPARQL queries

Using a Jena extension.

Much of the promise of RDF and Linked Data is the ease of pulling data from multiple sources and combining it. I recently discovered the SERVICE extension that Jena adds to SPARQL, letting you send subqueries off to multiple SPARQL endpoints and then combine the results. Because a given SPARQL endpoint may be an interface to a triplestore or a relational data store or something else, the ability to query several endpoints with one query is very nice.

The ability to query several endpoints with one query is very nice.

The Jena project’s ARQ - Basic Federated SPARQL Query describes the use of this keyword. Before I start quoting from that page, I wanted to jump right in with an example that worked for me to pull birthday and spouse information about Arnold Schwarzenegger from DBpedia and a list of his movies and their release dates from Linked Movie Database in one query:

PREFIX imdb: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


SELECT ?birthDate ?spouseName ?movieTitle ?movieDate {
  { SERVICE <http://dbpedia.org/sparql>
    { SELECT ?birthDate ?spouseName WHERE {
        ?actor rdfs:label "Arnold Schwarzenegger"@en ;
               dbpo:birthDate ?birthDate ;
               dbpo:spouse ?spouseURI .
        ?spouseURI rdfs:label ?spouseName .
        FILTER ( lang(?spouseName) = "en" )
      }
    }
  }
  { SERVICE <http://data.linkedmdb.org/sparql>
    { SELECT ?actor ?movieTitle ?movieDate WHERE {
      ?actor imdb:actor_name "Arnold Schwarzenegger".
      ?movie imdb:actor ?actor ;
             dcterms:title ?movieTitle ;
             dcterms:date ?movieDate .
      }
    }
  }
}

You can run this query yourself at the sparql.org RDF Query Demo page.

Before you start modeling your own queries on this, it’s worth reading the Jena documentation page mentioned above, especially the “Performance Considerations” part:

This feature is a basic building block to allow remote access in the middle of a query, not a general solution to the issues in distributed query evaluation. The algebra operation is executed without regard to how selective the pattern is. So the order of the query will affect the speed of execution. Because it involves HTTP operations, asking the query in the right order matters a lot. Don’t ask for the whole of a bookstore just to find book whose title comes from a local RDF file - ask the bookshop a query with the title already bound from earlier in the query.

As an example, both subqueries above specifically ask for information about Schwarzenegger instead of trying to scan the complete databases looking for matches.

Two parts of this trick are non-standard SPARQL, but may become part of SPARQL 1.1: subqueries and the SERVICE keyword. As the latter Lee Feigenbaum slide points out, the SPARQL Working Group is using ARQ’s SERVICE keyword as a starting point in thinking about how a query can target multiple endpoints.

My query above of the two different SPARQL endpoints also works from within TopQuadrant’s TopBraid Suite of products, so I’m sure I’ll be using this on work-related projects more and more.

3 Comments

By Taylor on January 4, 2010 9:07 PM

I knew we’d get you using Jena sooner or later. It’s got the best sparql IMHO.

By Karl Glatz on June 7, 2010 9:14 AM

Nice blog post!

I’m not able to test your query on the sparql.org Webpage, got some “Error 500: No dataset description for query”? Any suggestions?

By Bob on June 7, 2010 10:15 AM

Maybe one of the endpoints was down when you tried it. I just pasted the query above at the demo page, and it ran fine, i.e. it didn’t get an error. There were headers with no results under them, because DBpedia has changed the URL for birthDate to http://dbpedia.org/property/birthDate and no longer has a spouse value for Schwarzenegger, so the ?birthDate and ?spouseURI variables didn’t get bound.

The cleanup of DBpedia’s ontologies is obviously a good thing overall, but can break some queries. I have no idea why someone would remove his spouse value. Maria Shriver does have a spouse value of Schwarzenegger.

Bob