SPARQL and relational databases: getting started

Asking about tables and columns and doing a simple join.

In an earlier project I did querying relational data with SPARQL, I wanted to demonstrate how adding OWL metadata made it possible to answer useful queries that couldn’t have been answered with just the original data. I did this using two simple databases of one table each, but recently, while helping someone who was also using the D2RQ interface to provide access to some relational data, I decided to get more comfortable with using SPARQL to query data from multi-table relational databases.

I wanted to find ways to:

picture of globe

  • list a database’s tables

  • list the columns in those tables

  • run a query that joined data from at least two of the tables

It was all easier than I though it would be.

For a database to query, I went to the sample database section of MySQL’s documentation page and got the “world” database. This database has a country table with columns for country names, country codes unique within the table, and other information about the country such as its population and head of state. Another table, named countrylanguage, has a column for a country code to link to the country table, a column for a language spoken in that country, another for a figure showing the percentage of the country’s residents who speak that language, and a fourth column for a boolean IsOfficial value.

The following SQL query against that database lists the name of each country, a language spoken there, and what percentage of the population speak that language, with the rows sorted from high percentages to low:

SELECT Name, Language, Percentage FROM country, countrylanguage 
WHERE country.Code = countrylanguage.CountryCode
ORDER BY Percentage DESC;

Here’s how I developed a SPARQL query doing the same thing.

As I described in the writeup of my earlier project, the first step of using D2RQ with a particular database is pointing its generate-mapping utility at a database installed in a running copy of MySQL (or one of the other relational database managers that D2RQ supports) to generate an SQL-to-SPARQL mapping file for that database. Next, you start up the d2r-server program with that mapping file as a parameter to run a server that provides a SPARQL endpoint for that database.

You could then send SPARQL queries to that server’s SPARQL endpoint using Curl. For exploring the data, though, I prefer the SNORQL interface, because I can enter a query on a web form, click the Go button, and browse the result. (The SNORQL interface in the latest official release of D2RQ doesn’t get along with Firefox 3.0—apparently there’s already a fixed version in D2RQ’s cvs tree—so I used Chrome for this.)

When exploring a relational database, I first want to know what its tables are and what columns are in those tables. To do this with SPARQL queries, the D2RQ mapping file that I generated earlier provided some good clues. (Keep in mind that there may be other SPARQL interfaces to relational databases in the future, and they may not all map the relational structures to RDF the same way.)

First: what are the tables in the database? D2RQ treats each table row as a resource of type tablename. For example, it treats the country table’s row for France as a resource identified as http://localhost:2020/resource/country/FRA, which has a http://www.w3.org/1999/02/22-rdf-syntax-ns#type of http://localhost:2020/resource/vocab/country. So, by asking for a list of all the types with the following query (with DISTINCT added so that each only shows up once),

SELECT DISTINCT ?o WHERE {
  ?s  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o
}

SNORQL’s browse mode gives me this list (SNORQL defines “vocab” as the prefix for the http://localhost:2020/resource/vocab/ URI), which corresponds to the tables declared in the world database:

vocab:country 
vocab:countrylanguage 
vocab:city 

What are the columns of these tables? For an example of one column’s name, the world database’s country table has a row for France with a figure of 59225700 in the Population column. D2RQ represents the column name as a property named tablename_columnname, so the triple describing France’s population tells us that http://localhost:2020/resource/country/FRA has a http://localhost:2020/resource/vocab/country_Population of 59225700. How do we list all these predicate names? As I mentioned in How you can explore a new set of linked data, the following query lists all the predicates in an RDF-based data collection:

SELECT DISTINCT ?p WHERE {?s ?p ?o}

Displaying all predicates in the world database, though, includes a few extras along with the ones representing the table columns. The table columns are all in the http://localhost:2020/resource/vocab namespace, so the best way I could think of to query for predicates that were in that namespace was like this:

SELECT DISTINCT ?p 
WHERE { ?s ?p ?o.
        FILTER(regex(str(?p),"http://localhost:2020/resource/vocab/")).
}

It’s pretty kludgy, so I’d love to hear of a better alternative. It selects all the predicates in the data set and then only passes along the ones whose URLs, when converted to a string, have “http://localhost:2020/resource/vocab/” in them. If another namespace used http://localhost:2020/resource/vocab/foo as its URL, this FILTER would select predicates in that namespace as well, because it has that “http://localhost:2020/resource/vocab/” substring in there, but I couldn’t find a better way to do this. (Although SPARQL implements a subset of XPath, the namespace-uri function is not part of this subset.)

I won’t show you all 24 predicate names that that this query returns, but for the SPARQL version of the SQL query above that listed country names, languages, and percentages, I picked out the tablename_columnname predicates I needed from the list of 24 and used them to create this query:

SELECT ?name ?language ?percentage 
WHERE { ?s1 vocab:country_Name ?name;
            vocab:country_Code ?ccode.
        ?s2 vocab:countrylanguage_CountryCode ?ccode;
            vocab:countrylanguage_Language ?language;
            vocab:countrylanguage_Percentage ?percentage.
}
ORDER BY DESC(?percentage)

It’s not a very fancy query. The most interesting part is the use of the ccode variable to connect information in triples from the world database’s country table with information in triples from the database’s countrylanguage table. When I first started this I had no idea how I was going to do this SPARQL equivalent of an SQL join, but once I sat down and tried it was intuitive enough. I won’t show you all 984 results rows, but here’s a selection from the middle to give you the flavor of what’s there:

"Réunion"        "Creole French"  91.5
"Germany"        "German"         91.3
"Seychelles"     "Seselwa"        91.3
"Romania"        "Romanian"       90.7
"American Samoa" "Samoan"         90.6
"Syria"          "Arabic"         90.0
"Swaziland"      "Swazi"          89.9
"Bahamas"        "Creole English" 89.7
"Chile"          "Spanish"        89.7
"Sweden"         "Swedish"        89.5

Now I feel more confident about forging ahead to explore sets of relational data that I can access with SPARQL queries using the D2RQ interface.

4 Comments

By Irene Polikoff on October 20, 2008 9:18 PM

Bob,

Queries like SELECT DISTINCT ?p WHERE {?s ?p ?o} are OK for RDF stores because of their specialized indexing strategies, but I don’t think this would work well with D2RQ against a relational database.

It would be OK if the RDBMS had a very small amount of data. But on a database of any size, such queries are not practical. I suspect D2RQ will just try to get all the data in the entire database and then try to figure out predicates. Same for query you are suggesting for getting the table names.

D2RQ translates database schema in the following way - all tables get represented as owl:Class, all columns are exposed as properties connected to the classes using rdfs:domain statements.

So, a much less expensive way to get all tables is simply by using SELECT ?table WHERE {?table rdf:type owl:Class}

Then to get columns you can do ?column rdfs:domain ?table. In fact, this single query will get you all tables with the corresponding columns

SELECT ?table ?column WHERE{?column rdfs:domain ?table}

Regards,

Irene

By Bob DuCharme on October 20, 2008 10:55 PM

Thanks Irene! That all makes sense to me, but
I didn’t see any use of the http://www.w3.org/2002/07/owl# namespace (or even the string “owl”) in the mapping file generated by D2RQ, and

  SELECT ?table WHERE {?table rdf:type }

got me no results. This also got me no results:

SELECT ?table ?column WHERE{?column rdfs:domain ?table}

By Richard Cyganiak on October 21, 2008 3:29 PM

Bob, I enjoy your posts on practical RDF production and SPARQL usage a lot. Keep it up!

Irene is right that the queries for listing the classes and properties are not very efficient. However, her proposed replacements unfortunately don’t work – D2RQ only translates the instance data (records) from the DB to RDF, but does not provide an RDF view on the schema level. Information about classes and properties is on the schema level.

It would be nice if these statements were queryable through SPARQL, but this remains to be done for future versions of D2RQ.

So, at the moment, the queries you suggested are the best that can be done. They will work well enough for databases up to a few 100k records if the number of tables/columns isn’t very large. Note that Snorql uses variations of your queries to list classes and properties.

By Irene Polikoff on October 23, 2008 12:04 AM

Hmm… interesting. It works exactly as described for me. Of course, I use a version of D2RQ bundled with TopBraid Composer. And I know that Holger has made some changes. This could be one of them.