Living in a materialized world

Managing inferenced triples with named graphs.

Living in the Material World album cover

I’ve often thought that named graphs could provide an infrastructure for managing inferenced triples, and a recent Twitter exchange with Adrian Gschwend inspired me to follow through with a little demo.

Before I describe this demo I’m going to review some basic ideas about RDF inferencing and database denormalization. Then I’ll describe one approach to managing your own inferencing with an RDF version of database denormalization.

Inferencing

As I wrote in the “What Is Inferencing?” section of the “RDF Schema, OWL, and Inferencing” chapter of my book Learning SPARQL, “Webster’s New World College Dictionary defines ‘infer’ as ’to conclude or decide from something known or assumed.’ When you do RDF inferencing, your existing triples are the ‘something known,’ and your inference tools will infer new triples from them.” If you have triples saying that Lassie is an instance of dog, and dog is a subclass of mammal, and mammal is a subclass of animal, then an inferencing tool such as a SPARQL engine that implements RDFS will recognize the implications of the rdfs:subClassOf predicate used to make the last two statements. This means that if you query for all instances of mammal or animal it will include Lassie in the result.

The “Using SPARQL to Do Your Inferencing” section of that same chapter shows how a query like the following can implement some inferencing for this RDFS property if your SPARQL engine doesn’t have this feature built in:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

CONSTRUCT { ?instance a ?super . }
WHERE { 
  ?instance a ?subclass . 
  ?subclass rdfs:subClassOf ?super . 
}

You’d need to write such rules for all of the parts of RDFS and OWL that you wanted to implement—and even that might not be enough. Once the query above created a triple saying that Lassie is a mammal, it would be done, but a proper inferencing engine would then infer from that new triple that Lassie is also an animal.

The above technique can still be useful for simple inferencing like implementation of the rdfs:subPropertyOf property for data integration, as long as your subproperties don’t have subproperties, so I’ll call this technique “one-pass inferencing.” (I wrote about the implementation of similar inferencing in Driving Hadoop data integration with standards-based models instead of code.)

Database denormalization

To oversimplify a bit, relational database normalization is the process of working out which properties should be stored in which tables to avoid redundancy, because redundancy generally leads to inefficiency. When you store your customer’s addresses and information about each item that they ordered, you don’t want these in the same table; if one customer ordered three different items, then storing a copy of the address with all three items would take up unnecessary space and make it more difficult to update the address if that customer moves. If you store a unique customer number with the address in the customers table and also with each of the customer’s orders in a separate orders table, then when you want to list customer addresses with the items that each customer ordered, you tell the database system to do a join of the tables using the customer number to cross-reference the information.

Sometimes day-to-day operations of a large database system require millions of complex joins to fulfill common requests. This can lead a database administrator to introduce some redundancy in certain tables to increase the efficiency of these requests. We call this denormalization. Because of the potential problems of these redundancies, this requires careful management—perhaps clearing out and repopulating the denormalized tables every night at 2AM.

Storing RDF triples that could otherwise be inferred dynamically, or “materializing” those triples, is similar. They’re considered redundant because if you have all the information necessary to infer a certain piece of information, why store that information in your dataset? Because repeated inferencing of that information will require repeated usage of compute power to perform the same task. When you’re doing SPARQL queries this also limits your choice of SPARQL processors, because different SPARQL processors support different levels of inferencing depending on their support for RDFS and different OWL profiles. Many can’t do any inferencing at all.

Because you can do your own one-pass inferencing with CONSTRUCT queries (and with INSERT queries if you are using a triplestore that supports SPARQL UPDATE), you can do your own materializing to get the effects of denormalization.

Using named graphs to manage materialized triples

The rest of this assumes that you are familiar with querying and updating of SPARQL named graphs. To be honest, I use these rarely enough that I re-read the “Named Graphs” section of my book’s “Updating Data with SPARQL” chapter as a review before I assembled the steps below.

I mentioned above how the manager of a relational database might have to clear out and repopulate the denormalized tables periodically so that their information stays synchronized with the canonical data. With RDF, we can store materialized triples in named graphs to enable a similar effect. The steps below walk through one possible scenario for this using the Jena Fuseki triplestore.

Imagine that my company has two subsidiaries, company1 and company2, that use different schemas to keep track of their employees, and I’m using RDFS inferencing to treat all that data as if it conformed to the same schema. Here is a sample of company1 data:

# company1.ttl

@prefix c1d: <http://learningsparql.com/ns/company1/data#> . 
@prefix c1m: <http://learningsparql.com/ns/company1/model#> . 

c1d:rich c1m:firstName "Richard" . 
c1d:rich c1m:lastName "Mutt" . 
c1d:rich c1m:phone "342-667-9256" . 

c1d:jane c1m:firstName "Jane" . 
c1d:jane c1m:lastName "Smith" . 
c1d:jane c1m:phone "546-700-2543" . 

Here is some company2 data:

# company2.ttl 

@prefix c2d: <http://learningsparql.com/ns/company2/data#> . 
@prefix c2m: <http://learningsparql.com/ns/company2/model#> . 

c2d:i432 c2m:firstname "Nanker Phelge" . 
c2d:i432 c2m:surname "Mutt" . 
c2d:432 c2m:homephone "879-334-5234" . 

c2d:i245 c2m:firstname "Cindy" . 
c2d:i245 c2m:surname "Marshall" . 
c2d:i245 c2m:homephone "634-452-4678" . 

The two datasets use different properties in different namespaces (such as c1m:lastName vs. c2m:surname) to keep track of the same kinds of information.

The “upload files” tab of Fuseki’s web-based interface includes a “Destination graph name” field with a prompt of “Leave blank for default graph”. I specified a graph name of company1 when I uploaded company1.ttl and Fuseki gave this graph a full name of http://localhost:3030/myDataset/data/company1 because it was running on the default port of 3030 on my computer. (All of my queries below define d: as a prefix for http://localhost:3030/myDataset/data/, so I’ll use that to save some typing here.)

After uploading company2.ttl into a d:company2 named graph, I uploaded the following bit of modeling into a named graph called d:model. It names the company1 and company2 properties as subproperties of equivalent schema.org properties.

# integrationModel.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix c1m: <http://learningsparql.com/ns/company1/model#> . 
@prefix c2m: <http://learningsparql.com/ns/company2/model#> . 
@prefix schema: <http://schema.org/> . 

c1m:firstName rdfs:subPropertyOf schema:givenName . 
c1m:lastName rdfs:subPropertyOf schema:familyName . 
c1m:phone rdfs:subPropertyOf schema:telephone . 

c2m:firstname rdfs:subPropertyOf schema:givenName . 
c2m:surname rdfs:subPropertyOf schema:familyName . 
c2m:homephone  rdfs:subPropertyOf schema:telephone . 

If I loaded all of the above triples into a triplestore that could do inferencing, I could query for schema:givenName, schema:familyName, and schema:telephone values right away and get all of the company1 and company2 data with that one query. For this example, though, I’m going to show how to do one-pass inferencing to set the stage for a query that can retrieve all that data using the schema.org property names.

The next step was to do that inferencing—that is, to create the inferred triples. Before updating data in a triplestore with an INSERT command, it’s good to do a CONSTRUCT query to double-check that you’ll be creating what you had hoped to, so I ran the following query. It looks in a dataset’s default graph and any named graphs for resources that have properties that are subproperties of other properties and then creates triples using those superproperties:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX d: <http://localhost:3030/myDataset/data/>

CONSTRUCT  { ?s ?superProp ?o }
WHERE
{
   { ?s ?p ?o }
   UNION
   { GRAPH ?g { ?s ?p ?o } }
   GRAPH d:model {?p rdfs:subPropertyOf ?superProp } .
}

For example, when this query sees that that c2m:firstname is a subproperty of schema:givenName and that c2d:i245 has a c2m:firstname of “Cindy”, it constructs a triple saying that c2d:i245 has a schema:givenName of “Cindy”. In other words, it expresses the original fact using a schema.org property in addition to the property from company1’s schema.

The complete result of this query showed all of the company1 and company2 data but using the schema.org properties instead of their original schemas. Being the result of a CONSTRUCT query, though, these triples are temporary.

I was then ready to really do the INSERT version of this query so that the new triples would be part of my dataset. That is, I was ready to do the actual inferencing. The following similar query inserts those triples into their own d:inferredData named graph so that when the time comes to update this redundant data it will be simple to clean out these materialized triples.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX d: <http://localhost:3030/myDataset/data/>

INSERT { GRAPH d:inferredData  { ?s ?superProp ?o }}
WHERE
{
  { ?s ?p ?o }
  UNION
  { GRAPH ?g { ?s ?p ?o } }
  GRAPH d:model {?p rdfs:subPropertyOf ?superProp } .
}

I used this next query to see if the triples I had seen with the recent CONSTRUCT query all got added to this new d:inferredData graph by the INSERT request. They had:

PREFIX d: <http://localhost:3030/myDataset/data/>

SELECT ?s ?p ?o
WHERE
{ 
   GRAPH d:inferredData { ?s ?p ?o } 
}

At this point I had integrated data from the two companies to conform to a common, standard model, and I could proceed with all the benefits of this arrangement as I queried across the two sets of employees by using the shared schema.

But, let’s say that Jane Smith changes her contact number from 546-700-2543 to 546-111-2222. This gets updated in the original company1 data in the d:company1 named graph with the following update request:

PREFIX d: <http://localhost:3030/myDataset/data/>
PREFIX c1d: <http://learningsparql.com/ns/company1/data#> 
PREFIX c1m: <http://learningsparql.com/ns/company1/model#> 

DELETE
{ GRAPH d:company1 { c1d:jane c1m:phone "546-700-2543" . } }
INSERT
{ GRAPH d:company1 { c1d:jane c1m:phone "546-111-2222" . } }
WHERE
{ GRAPH d:company1 { c1d:jane c1m:phone "546-700-2543" . } }

If I query the schema.org version of the data for Jane’s phone number I will still get her old one. This is easy enough to fix; first I blow away all the materialized triples,

PREFIX d: <http://localhost:3030/myDataset/data/>
DROP GRAPH d:inferredData

and then I regenerate up-to-date versions with the same INSERT command I used earlier. Problem solved. (If I have terabytes of triples of employee data, this DROP GRAPH followed by a new inferencing pass is the part that I’d do at 2AM each morning.)

Applying These Steps

I did all this by going to various Fuseki screens and pasting queries in. Fuseki has a nice feature in which after you run any query—even an update query—it shows you the URL and the curl command that would execute the same operation. This lets you string togethether these steps in a shell script that automates their execution, which would be handy for a production application. Instead of pasting all those queries into web forms I could just run that script, or, in the case of the 2AM updates, have a cron job run the script as I slept.

For a production application, there are a few other things I might change. For example, if there were millions of triples of company1 data and millions of triples of company2 data I might do the inferencing over just one or the other instead of everything at once. Assuming that they got updated on different schedules (because they are, after all, different companies) this would skip some unnecessary processing.

The ultimate lesson is that while named graphs are not particularly popular in typical SPARQL usage, they can be useful for managing large amounts of triples in which different sets of triples play different roles, and the materializing of inferenced triples is one nice example.

Are you using named graphs for any production application? Let me know at @bobdc or at @learningsparql.