Reification is a red herring

And you don't need property graphs to assign data to individual relationships.

RDF's very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better.

I recently tweeted that the ZDNet article Back to the future: Does graph database success hang on query language? was the best overview of the graph database world(s) that I’d seen so far, and I also warned that many such “overviews” were often just Neo4j employees plugging their own product. (The Neo4j company is actually called Neo Technology.) The most extreme example of this is the free O’Reilly book Graph Databases, which is free because it’s being given away by its three authors’ common employer: Neo Technology! The book would have been more accurately titled “Building Graph Applications with Cypher”, the Neo4j query language. This 238-page book on graph databases manages to mention SPARQL and Gremlin only twice each. The ZDNet article above does a much more balanced job of covering RDF and SPARQL, Gremlin and Tinkerpop, and Cypher and Neo4j.

The DZone article RDF Triple Stores vs. Labeled Property Graphs: What’s the Difference? is by another Neo employee, field engineer Jesús Barrasa. It doesn’t mention Tinkerpop or Gremlin at all, but does a decent job of describing the different approach that property graph databases such as Neo4j and Tinkerpop take in describing graphs of nodes and edges when compared with RDF triplestores. Its straw man arguments about RDF’s supposed deficiencies as a data model reminded me of a common theme I’ve seen over the years.

The fundamental thing that most people don’t get about RDF, including many people who are successfully using it to get useful work done, is that RDF’s very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better. Just because RDF doesn’t require the use of schemas doesn’t mean that it can’t use them; the RDF Schema Language lets you declare classes, properties, and information about these that you can use to drive user interfaces, to enable more efficient and readable queries, and to do all the other things that people typically use schemas for. Even better, you can develop a schema for the subset of the data you care about (as opposed to being forced to choose between a schema for the whole data set or no schema at all, as with XML), which is great for data integration projects, and then build your schema up from there.

Barrasa writes of property graphs that “[t]he important thing to remember here is that both the nodes and relationships have an internal structure, which differentiates this model from the RDF model. By internal structure, I mean this set of key-value pairs that describe them.” This is the first important difference between RDF and property graphs: in the latter, nodes and edges can each have their own separate set (implemented as an array in Neo4j) of key-value pairs. Of course, nodes in RDF don’t need this; to say that the node for Jack has an attribute-value pair of (hireDate, “2017-04-12”), we simply make another triple with Jack as the subject and these as the predicate and object.

Describing the other key difference, Barrasa writes that while the nodes of property graphs have unique identifiers, “[i]n the same way, edges, or connections between nodes–which we call relationships–have an ID”. Property graph edges are unique at the instance level; if Jane reportsTo Jack and Jack reportsTo Jill, the two reportsTo relationships here each have their own unique identifier and their own set of key-value pairs to store information about each edge.

He writes that in RDF “[t]he predicate will represent an edge–a relationship–and the object will be another node or a literal value. But here, from the point of view of the graph, that’s going to be another vertex.” Not necessarily, at least for the literal values; these represent the values in RDF’s equivalent of the key-value pairs–the non-relationship information being attached to a node such as (hireDate, “2017-04-12”) above. This ability is why a node doesn’t need its own internal key-value data structure.

He begins his list of differences between property graphs and RDF with the big one mentioned above: “Difference #1: RDF Does Not Uniquely Identify Instances of Relationships of the Same Type,” which is certainly true. But, his example, which he describes as “an RDF graph in which Dan cannot like Ann three times”, is very artificial.

One of his “RDF workarounds” for using RDF to describe that Dan liked Ann three times is reification, in which we convert each triple to four triples: one saying that a given resource is an RDF statement, a second identifying the resource’s subject, a third naming the predicate, and a fourth naming the object. This way, the statement itself has identity, and we can add additional information about it as triples that use the statement’s identifier as a subject and additional predicates and objects as key-value pairs such as (time, “2018-03-04T11:43:00”) to show when a particular “like” took place. Barrasa writes “This is quite ugly”; I agree, and it can also do bad things to storage requirements.

In my 15 years of working with RDF, I have never felt the need to use reification. It’s funny how the 2004 RDF Primer 1.0 has a section on reification but the 2014 RDF Primer 1.1 (of which I am proud to be listed in the Acknowledgments) doesn’t even mention reification, because simpler modeling techniques are available, so reification was rarely if ever used.

By “modeling techniques” I mean “declaring and then using a model”, although in RDF, you don’t even have to declare it. If you want to keep track of separate instances of employees, or games, or buildings, you can declare any of these as a class and then create instances of it; similarly, if you want to keep track of separate instances of a particular relationship, declare a class for that relationship and then create instances of it.

How would we apply this to Barrasa’s example, where he wants to keep track of information about Likes? We use a class called Like, where each instance identifies who liked who. (When I first wrote that previous sentence, I wrote that we can declare a class called Like, but again, we don’t need to declare it to use it. Declaring it is better for serious applications where multiple developers must work together, because part of the point of a schema is to give everyone a common frame of reference about the data they’re working with.) The instance could also identify the date and time of the Like, comments associated with it, and anything else you wanted to add as a set of key-value pairs for each Like instance that is implemented as just more triples.

Here’s an example. After optional declarations of the relevant class and properties associated with it, the following has four Likes showing who liked who when and a “foo” value to demonstrate the association of arbitrary metadata with that Like.

@prefix d:    <http://learningsparql.com/ns/data/> .
@prefix m:    <http://learningsparql.com/ns/model/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 


# Optional schema.
m:Like  a rdfs:Class .          # A class...
m:liker rdfs:domain m:Like .    # and properties that go with this class.
m:liked rdfs:domain m:Like .
m:foo   rdfs:domain m:Like .


[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:43:00" ;
   m:foo "bar" .


[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:58:00" ;
   m:foo "baz" .


[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T12:04:00" ;
   m:foo "bat" .


[] a m:Like ;
   m:liker d:Ann ;
   m:liked d:Dan ;
   m:time "2018-03-04T12:06:00" ;
   m:foo "bam" .

Instead of making up specific identifiers for each Like, I made them blank nodes so that the RDF processing software will generate identifiers and keep track of them.

As to Barrasa’s use case of counting how many times Dan liked Ann, it’s pretty easy with SPARQL:

PREFIX d: <http://learningsparql.com/ns/data/> 
PREFIX m: <http://learningsparql.com/ns/model/>


SELECT (count(*) AS ?likeCount) WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann .
}

(This query would actually work with just the m:liker and m:liked triple patterns, but as with the example that I tweeted to Dan Brickley about, declaring your RDF resources as instances of classes can lay the groundwork for more efficient and readable queries.) Here is ARQ’s output for this query:

-------------
| likeCount |
=============
| 3         |
-------------

Let’s get a little fancier. Instead of counting all of Dan’s likes of Ann, we’ll just list the ones from before noon on March 3, sorted by their foo values:

PREFIX d: <http://learningsparql.com/ns/data/> 
PREFIX m: <http://learningsparql.com/ns/model/>


SELECT ?fooValue ?time WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann ;
        m:time ?time ;
        m:foo ?fooValue .
FILTER (?time < "2018-03-04T12:00")
}
ORDER BY ?fooValue

And here is ARQ’s result for this query:

------------------------------------
| fooValue | time                  |
====================================
| "bar"    | "2018-03-04T11:43:00" |
| "baz"    | "2018-03-04T11:58:00" |
------------------------------------

After working through a similar example for modeling flights between New York and San Francisco, Barrasa begins a sentence “Because we can’t create such a simple model in RDF…” This is ironic; the RDF model is simpler than the Labeled Property Graph model, because it’s all subject-predicate-object triples without the use of additional data structures attached to the graph nodes and edges. His RDF version would have been much simpler if he had just created instances of a class called Flight, because again, while the base model of RDF is the simple triple, more complex models can easily be created by declaring classes, properties, and information about those classes and properties–which we can do by just creating new triples!

To summarize, complaints about RDF that focus on reification are so 2004, and they are a red herring, because they distract from the greater power that RDF’s modeling abilities bring to application development.

A funny thing happened after writing all this, though. As part of my plans to look into Tinkerpop and Gremlin and potential connections to RDF as a next step, I was looking into Stardog and Blazegraph’s common support of both. I found a Blazegraph page called Reification Done Right where I learned of Olaf Hartig and Bryan Thompson’s 2014 paper Foundations of an Alternative Approach to Reification in RDF. If Blazegraph has implemented their ideas, then there is a lot of potential there. And if the Blazegraph folks brought this with them to Amazon Neptune, that would be even more interesting, although apparently that hasn’t shown up yet.