Really linking them, not doing ETL.

Lately I’ve been thinking about some aspects of RDF technology that I have taken for granted as basic building blocks of dataset design but that Knowledge Graph fans who are new to RDF may not be fully aware of—especially when they compare RDF to alternative ways to build knowledge graphs. A key building block is the ability to link independently created knowledge graphs.

...it gives a better idea of what the “semantic web” was about: the world-wide linking of, not just documents, but (in more 2021 terminology) knowledge graphs.

For a little historical perspective: before Tim Berners-Lee invented the web, hypertext systems were all very closed systems. A Storyspace story (one of which I still own on a three and half inch floppy disk) could not link to an Apple Hypercard “stack” and a HyperCard stack could not link to a Storyspace story. The World Wide Web let any hypertext page anywhere in the world link to any other, and just look how far that has scaled.

Imagine that you and I want to create relational data and use it in the same SQL system. We can’t just go off and each define our own database schema and expect our two databases to work together. The design work must be coordinated so that our respective contributions are essentially designed as a single system. Otherwise, the data from your system must be read (Extracted from your system), converted to be compatible with my system (Transformed), and then Loaded into my system—a process known in the industry as ETL. If the data in your system later gets updated, my system’s users won’t know it until we repeat the whole process or invoke some custom ETL process to identify and retrieve the new parts.

This was never the case with independently designed web pages, because anyone’s page could link to anyone else’s web page, and it’s not the case with RDF knowledge graphs. If I make one available on the public Internet, you can connect yours to mine so that as your and my datasets evolve, the connections themselves can remain the same but you’ll gain the benefits of the updated datasets. If we’re using different identifiers to refer to the same things, a little modeling can be part of the connection to indicate which things are the same, and then you’re off and running using the two datasets as one knowledge graph.

The format of RDF graph node identifiers follow a published IETF standard. The identifiers themselves remain universally unique (as with Java package names, they’re built off of domain names, which lets domain owners establish their own naming conventions) so your ability to reference one of my graph nodes from your data means that a link from your data to mine will work very simply. This was the “linked” part of Linked Data, and getting back to the once-revolutionary possibility of any hypertext document linking to any other hypertext document, it gives a better idea of what the “semantic web” was about: the world-wide linking of, not just documents, but (in more 2021 terminology) knowledge graphs, especially when modeling of the graph can be part of the graph itself.

(Just to whet your appetite, I’m going to demonstrate all of this below by linking a new graph of the Beatles’ favorite drinks and sports to a remote graph I made several years ago about who played what instruments on which songs.)

There are two basic steps to linking your knowledge graph to someone else’s. As with HTML documents, you don’t need any kind of permission or cooperation from anyone on the destination system to make the link if that destination is available via HTTP.

1. Either use the same resource identifiers that the graph you are linking to does or add some modeling that maps your identifiers to theirs.
2. Use a SPARQL query (and implicitly, the SPARQL protocol—the “P” in SPARQL that defines a standard way to to transmit queries and results back and forth) to ask an endpoint for a graph subset meeting the conditions for your application.

The nodes of a graph need identifiers, and some non-RDF graph storage systems keep these under the covers to make your life simpler. If you can see these identifiers, though, both in your own graphs and in the graphs of others, it’s easier to identify nodes in the different graphs and create connections between them, connecting their host graphs into a larger graph. (And, RDF URI identifiers are no more difficult to read than URLs, which aren’t that difficult to read… unless you’re in SharePoint world, in which case you have my sympathy.) If you know that two graphs use different identifiers for the same resource, your own data model can assert that both identifiers reference the same resource—with the data model statements just being additional edges on your own graph—and then standards-compliant (often free!) software can then take advantage of those assertions.

To quote Pascal Hitzler’s recent Communications of the ACM article A Review of the Semantic Web Field (which uses the abbreviation IRI for “Internationalized Resource Identifiers”, a superset of URIs that allow a broader range of character choices),

What is usually associated with the term “linked data” is that linked data consists of a (by now rather large) set of RDF graphs that are linked in the sense that many IRI identifiers in the graphs also appear also in other, sometimes multiple, graphs. In a sense, the collection of all these linked RDF graphs can be understood as one very big RDF graph.

Retrieving remote data from a SPARQL endpoint

A SPARQL query can use the SERVICE keyword to request data from another graph via an endpoint. It can then combine the retrieved data with local data and use the combination as a larger graph with more helpful connections than the local data has. For example, let’s say that after reviewing the website The Beatles Interview Database you’ve compiled the following data about the Fab Four, and you have it stored locally in a file called BeatlesFaves.ttl:

@prefix b: <http://www.bobdc.com/ns/beatles/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

# Sources:
# http://www.beatlesinterviews.org/db1964.0614b.beatles.html
# http://www.beatlesinterviews.org/db1964.0906.beatles.html

wd:Q2632 rdfs:label "Ringo Starr" ;
b:favoriteDrink "bourbon" ;
b:favoriteBritishGroup "Rolling Stones" .

wd:Q1203 rdfs:label "John Lennon" ;
b:favoriteBritishGroup "Rolling Stones" .

wd:Q2599 rdfs:label "Paul McCartney" ;
b:favoriteBritishGroup "The Searchers" ;
b:favoriteDrink "scotch and Coke" .

wd:Q2643 rdfs:label "George Harrison" ;
b:favoriteBritishGroup "The Animals" .


This data uses the Wikidata identifiers to identify the individual Beatles so that the data will more easily integrate with other data that may use these identifiers—just as Pascal described above—such as Wikidata itself. (The Wikidata identifiers are easy to find; just look for “Wikidata item” on the left side of any Wikipedia page, such as Ringo’s.)

You have learned that several years ago some guy (OK, me) published an RDF graph of data at http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl about who played what instruments on which Beatles songs. (The creation of this dataset is described at SPARQL queries of Beatles recording sessions along with some fun queries.) The URIs in that dataset that identify the musicians use URIs built from the musicians’ names instead of using Wikipedia URIs. (There were so many musicians that I didn’t want to look all of them up in Wikidata manually, and some have names that are common enough that automating the lookup wouldn’t have worked too well.)

To show that wd:Q2632 from one graph is the same as m:RingoStarr from the other I created a triple using owl:sameAs. This predicate basically says “all facts about each of these two resources are true for the other one, so they are effectively the same resource”.

My use of an OWL predicate required me to use a SPARQL processor that could handle more than RDFS. (See Transforming data with inferencing and (partial!) schemas for examples of RDFS inferencing as part of a graph processing pipeline.) I only needed a little more than RDFS; “RDFS Plus” is a non-standard superset that adds owl:sameAs support and a few other useful OWL bits to a SPARQL processor without committing to a full implementation of one of the official OWL Profiles.

To get this owl:sameAs support I used the free version of the GraphDB triplestore, which I’ve also used recently because of its GeoSPARQL support. “RDFS Support” is something you select when creating a GraphDB repository, so I did that and unchecked GraphDB’s “Disable owl:SameAs” checkbox. (I’m guessing that this checkbox is available because overuse of owl:sameAs can use a lot of computing cycles.)

After loading the BeatlesFaves.ttl data above, I loaded the following mapToWikidata.ttl file:

@prefix m:   <http://learningsparql.com/ns/musician/> .
@prefix wd:  <http://www.wikidata.org/entity/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

wd:Q2632 owl:sameAs m:RingoStarr  .
wd:Q1203 owl:sameAs m:JohnLennon .
wd:Q2599 owl:sameAs m:PaulMcCartney .
wd:Q2643 owl:sameAs m:GeorgeHarrison .


After doing this, a query of this repository for all the triples showed statements like {m:GeorgeHarrison rdfs:label "George Harrison"}, which was not a triple in either of the loaded knowledge graphs but was inferred from the combination, so I knew I was all set.

The SPARQL query

I could have read the http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl file into GraphDB just like I read in BeatlesFaves.ttl and mapToWikidata.ttl, but that would be the old-fashioned ETL approach, where querying across datasets is really a query of a single dataset created by copying them all into one place. What if the remote dataset got updated with the names of the cellists on “The Long and Winding Road”, which are currently not there? I would have to either identify the new triples added to the remote data or reload the whole remote dataset. Instead of reading in the entire remote dataset, I would rather read the data that I need from it dynamically at query time to make sure that I had the most recent data.

I can do this with SPARQL’s SERVICE keyword. This specifies the URL of a SPARQL endpoint and a query to send to it. The following query uses this keyword to find out the favorite British band of the bass player from “The Long and Winding Road”:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX i:    <http://learningsparql.com/ns/instrument/>
PREFIX s:    <http://learningsparql.com/ns/schema/>
PREFIX b:    <http://www.bobdc.com/ns/beatles/>

SELECT ?britishGroup
WHERE { ?bassist b:favoriteBritishGroup ?britishGroup .
SERVICE <https://dydra.com/bobdc/beatles-musicians/sparql>
{ SELECT ?bassist
WHERE { ?song a s:Song ;
rdfs:label "The Long And Winding Road" ;
i:bass ?bassist .
}
}
}


(Fun fact I had never noticed before: John plays bass on that.) For this demo, I stored the data on Dydra, which made it very easy to create a free account, upload data and make it available via a SPARQL endpoint, and to then set user access levels to that data. The data can be maintained on Dydra easily enough, so that a call to a Dydra endpoint really is retrieval from a dynamic database.

The inner query above asks the remote data about the song’s bass player, binding the URI representing the bassist to the ?bassist variable. The outer query then asks for the favorite British group of this bass player, which turned out to be the Rolling Stones.

Note that the ?bassist variable will store the identifier http://learningsparql.com/ns/musician/JohnLennon and the locally-stored data says that the Stones were the favorite British band of resource wd:Q1203. That’s why I added a modeling triple wd:Q1203 owl:sameAs m:JohnLennon and used GraphDB, a triplestore that supports owl:sameAs as part of the RDFS superset that it supports. Remember, not all triplestores do, so that’s something to think about when planning an application.

This ability to send a subquery off to a remote system and then the result is an important aspect of both the SPARQL query language and the SPARQL protocol, which has its own standardized specification. When you consider different systems that may play roles in building and using knowledge graphs, keep in mind that SPARQL’s mechanics for tying local and remote data together are both standardized and widely implemented.

(A background note: I had also planned to show another query that retrieved the recording session data from http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl using SPARQL’s FROM keyword. When I investigated why some SPARQL processors did retrieve remote RDF files specified by this keyword and some didn’t, I learned that as a security consideration this retrieval is not required.)

SPARQL’s ability to link together different RDF knowledge graphs—even when those graphs aren’t necessarily using the same identifiers to refer to the same resources—provides another huge benefit: it reduces the need for large complex schemas (typically, ontologies) to create useful knowledge graphs. Imagine that I create a small RDF knowledge graph that achieves certain goals, and then you create another that achieves different goals, and then a third person realizes that these two graphs are both related to the application that she is working on. Ideally, you and I would have each included a schema (which is just more triples!) listing the classes and properties we used; even small schemas would help people like this third person take advantage of our datasets. Whether we made schemas available or not, though, she can use the technique described above to connect the two graphs into a whole that is greater than the sum of its parts, growing into a larger knowledge graph the way that the collection of HTML pages available via HTTP has grown into the World Wide Web since 1993.