Transforming data with inferencing and (partial!) schemas

An excellent compromise between schemas and "schemaless" development.

If you’re working with more than one RDF dataset, then the use of RDFS to identify little subsets of those datasets and to specify relationships between components of those subsets can help your knowledge graph and the applications that use it become useful a lot sooner.

I originally planned to title this “Partial schemas!” but as I assembled the example I realized that in addition to demonstrating the value of partial, incrementally-built schemas, the steps shown below also show how inferencing with schemas can implement transformations that are very useful in data integration. In the right situations this can be even better than SPARQL, because instead of using code—whether procedural or declarative—the transformation is driven by the data model itself.

Also, the models are RDF Schemas, also known as RDFS. When people talk about RDF inferencing, they’re often talking about some of the more advanced inferencing that the superset (actually, supersets) of RDFS known as OWL can do. Many people don’t realize how much you can do with simple RDFS inferencing.

Schema, no schema, or… some schema

For most of the history of data processing on computers, people needed to spell out the structure of their data before they could actually start accumulating data. For example, when using relational databases, you can’t add a row to a table unless you (or someone) has already specified all the columns that are going to be in that table and all of their types. In fact, you probably had to do this for all the database’s other tables as well, because the tables aren’t really ready until their relationships have all been straightened out through the process of normalization.

The rise of NoSQL databases—especially MongoDB—and the fact that schemas were optional for XML got developers excited about the ability to add any data of any structure they wished to a dataset. Since then, blogs have been full of debates about the value of developing with vs. without schemas. Not enough people appreciate the wonderful compromise offered by RDF knowledge graphs, where partial schemas can give you the best of both worlds, so I wanted to demonstrate that.

Finding a big mess of RDF

I wanted to start with an RDF dataset that was bigger and more complex than I needed so that I could show how a schema for just a subset of it could help to get only the parts that I wanted. On a page for the YouTube offering of a search engine API company I found a 26K sample of JSON that their API would return on a search for “Star Wars”, so I used AtomGraph’s JSON2RDF to convert that to an RDF file that I called ytstarwars.ttl.

This turned the JSON’s unnamed containers into a lot of triples with blank nodes in the RDF. The structural relationships of these triples were easier to see after a look at the JSON, which had some header data and then an array named movie_resultswith JSON objects about movies that each had title, description, and several other properties. After that was a similar array named video_results that had objects with title and description properties and others (not identical to the movie object properties) that I didn’t care about.

After telling JSON2RDF to use a base URI of in its output I got a lot of RDF triples with predicates of (hereafter, t:video_results), blank node subjects that represented the array, and blank node objects that represented the videos themselves. To describe the individual videos, the RDF included triples with these videos as subjects and predicate-object pairs like (t:title “Star Wars: The Empire Strikes Back”) and (t:link".)

It sounds messy but isn’t too bad when you flip back and forth between the JSON and the RDF that JSON2RDF created. The fun part was transforming this RDF into something simpler and cleaner—not with the SPARQL CONSTRUCT queries that I would typically use to turn one set of RDF into another, but with a schema and inferencing.

Transforming with a schema

I started the schema with just this:

# pschema1.ttl
@prefix t:    <> .
@prefix rdfs: <>  .

t:Video a rdfs:Class .
t:title rdfs:domain t:Video .

The first triple declares t:Video to be a class.

We often use the rdfs:domain property to say “this property is associated with this class”, which is a typical thing to do in a data model, but the second triple above actually does more than that: it says that if a resource has a t:title property, then an inferencing parser should infer that this resource is an instance of the t:Video class. (Or, in triple terms: if the parser finds a triple with t:title as its predicate, then infer a new triple saying that the found triple’s subject is an instance of the specified class.)

Several of the command line utilities that come with Apache Jena let you use an --rdfs switch to point to a vocabulary file of triples to use for inferencing. Here’s how I used Jena’s riot utility to parse the Turtle version of the YouTube Star Wars query result with inferencing based on the schema above:

riot --rdfs=pschema1.ttl ytstarwars.ttl > temp.ttl

The result is a copy of the input with triples like these added:

_:Ba465efcc265d609003ef1776e61da647 rdf:type t:Video .
_:Ba465efcc265d609003ef1776e61da647 rdf:title "LEGO® Star Wars™ The Build Zone" .

In addition to videos in the search results, there were also movie results from the movie_results array, so let’s declare the same rdfs:Class and rdfs:domain triples for them and then do more inferencing…

But there’s a problem. Movie results also have t:title properties, and the schema above says that anything with a t:title is a video result. How can the schema distinguish between videos and movies, and how can we say that both videos and movies have titles?

I mentioned earlier that the RDF created by AtomGraph includes triples with predicates of t:video_results, blank node subjects that represent the video results array, and blank node objects that represent the members of the array—the videos themselves. It also includes similar t:movie_results triples to store movies.

The first draft of the schema above used RDF’s rdfs:domain property to say that if a triple has a particular predicate then the resource represented by its subject is an instance of a particular class. The second draft uses a different part of the RDFS vocabulary: rdfs:range.

# pschema2.ttl
@prefix t:    <> .
@prefix rdfs: <>  .

t:Video a rdfs:Class .
t:Movie a rdfs:Class .

t:video_results rdfs:range t:Video . 
t:movie_results rdfs:range t:Movie . 

Unlike the rdfs:domain property, the rdfs:range property tells the inferencing engine that if a particular property is used as a triple’s predicate, then that triple’s object is a member of a particular class. The t:video_results triple in this new schema tells the inferencing engine that when it sees the triple {_:blankNode1 t:video_results _:blankNode2} in the input, it should create the triple {_:blankNode2 a t:Video}. The other rdfs:range triple in the schema does something similar to say that the object of t:movie_results triples are instances of t:Movie.

The first two triples in the new schema declare those two classes, but strictly speaking this isn’t necessary. If the schema says that _:blankNode1 is a member of a particular class, then the inference engine will infer that that class exists. It’s still worth declaring the classes, though, because an important reason to have schemas in the first place is to show the structure of the data to people using that data so that they can get more out of it.

Running a similar riot command with the new schema then creates new triples such as the following:

_:B43a50d34335d3e6c8db6403bc5bea2cf a t:Movie .
_:B2f3a9c7d55b4e5ab6272a20db6a16b97 a t:Video . 

How do we show that the title, description, and link properties in the triples generated by AtomGraph apply to videos and movies but not necessarily to other classes that may come up in this data? With another incremental modeling step: we’ll make the Movie and Video classes subclasses of another class (in this case, CreativeWork from; I may as well take advantage of an existing standard to make the data more interoperable with other applications) and declare that the properties go with that superclass:

# pschema3.ttl
@prefix t:    <> .
@prefix rdfs: <>  .
@prefix s:    <> 

t:Video a rdfs:Class ;
        rdfs:subClassOf s:CreativeWork . 

t:Movie a rdfs:Class ;
        rdfs:subClassOf s:CreativeWork . 

t:video_results rdfs:range t:Video . 
t:movie_results rdfs:range t:Movie . 

t:title rdfs:domain s:CreativeWork .
t:link rdfs:domain s:CreativeWork .
t:description rdfs:domain s:CreativeWork .

Here are some of the triples generated by riot from that schema, with blank node names and t:description values shortened to fit here better:

_:Ba2f a t:Video .
_:Ba2f a s:CreativeWork .
_:Ba2f t:title "2020 Portrayed by Star Wars" .
_:Ba2f t:link "" .
_:Ba2f t:description "A Parody of Star Wars in which..." .

_:B166 a t:Movie .
_:B166 a s:CreativeWork .
_:B166 t:title "Star Wars: The Empire Strikes Back" .
_:B166 t:link "" .
_:B166 t:description "Discover the conflict between good and ..." .

There is a lot more modeling that I could do with this data. I could take greater advantage of the ontology and maybe Dublin Core as well so that my data interoperates with other data and applications better. The remainder of the data converted by AtomGraph has more properties and classes which I may or may not care about. If I do, I can add more to my schema; if I don’t, I’m done.

The value of inferencing from schemas is really just a bonus to this exercise. The original key points I meant to prove here are:

  • A little schema can provide a little value right away.

  • Incrementally building on it can provide more and more value.

  • Your schema doesn’t need to cover all of your input data.

In my last blog entry I wrote about the excellent “Knowledge Graphs” paper (pdf) written by some experts in many related topics as a product of a Schloss Dagstuhl conference in 2018. One bit of that paper that I quoted is very relevant to this blog entry as well:

Graphs allow maintainers to postpone the definition of a schema, allowing the data – and its scope – to evolve in a more flexible manner than typically possible in a relational setting, particularly for capturing incomplete knowledge.

This idea of letting the data and its schema evolve in a more flexible manner is especially great for data integration projects. My example here started off with a (somewhat) big mess of RDF; if you’re working with more than one RDF dataset—maybe with some converted from other formats such as JSON or relational databases—then the use of RDFS to identify little subsets of those datasets and to specify relationships between components of those subsets can help your knowledge graph and the applications that use it become useful a lot sooner.

It works at the other end of the scale as well. For proof of concept work, a small bit of data with a small schema can help to prove your concept. From there, incrementally adding to this data and schema can get those who saw the proved concept more and more interested as you build it up. This agile approach goes over well with software developers, who have good reasons to be suspicious of starting off with a large complex schema. (I actually consider a schema with no corresponding data to only be a schema proposal: how do we know that the schema is doing a good job? The academic world is full of these, although they are more often known as ontologies.)

Note that I did all of this without any SPARQL. I would probably use some SPARQL as one more step to pull out the inferred triples instead of keeping all those original triples about the JSON file’s structure that AtomGraph generated, but that would be as a convenience. The main work of transforming the data subset that I had into the model that I wanted was still performed with the RDFS model.

I’ve written another example of how incremental schema development can benefit an application in in Driving Hadoop data integration with standards-based models instead of code. (Note the subtitle: “RDFS Models!”) The main point at the time was to show how this could all work on a Hadoop infrastructure. I took RDF generated from two different employee databases with two different structures, built a small model that integrated subsets of them, ran a script that performed the integration, expanded the model, and ran the same script to perform a larger integration with no changes to the script itself. Hadoop or no Hadoop, this example provides another nice example of how RDFS inferencing with gradually growing schemas can help you take advantage of existing datasets that were not originally designed for your application.