Taking some RDF beyond what it could do in a relational database

Part 2 of 2.

February 27, 2022

In my last posting I described Carnegie Mellon University’s Index of Digital Humanities Conferences project, which makes over 60 years of Digital Humanities research abstracts and relevant metadata available on both the project’s website and as a file of zipped CSV that they update often. I also described how I developed scripts to convert all that CSV to some pretty nice RDF and made the scripts available on github. I finished with a promise to follow up by showing some of the things we can do with RDF versions of this data that we can’t do (or at least, can’t do nearly as easily) with the relational version. And here we are.

Easier addition of new properties that only apply to a few instances of some classes

What if you want to store additional data about the abstracts, conferences or authors? For example, if you want to store the hash tags associated with the conferences? The Chesapeake Digital Humanities Consortium 2020 conference (<http://rdfdata.org/dha/conference/i170> in my RDF data) has a dha:url value of https://chesapeakedh.github.io/conference-2020. That’s the conference home page, and if I go there I see that the conference hash tag is #CDHC20. When I’m at a conference—or not there and wishing that I was—Twitter searches for the conference’s hashtag can tell me interesting things that are going on or about to go on. This means that a Twitter hashtag is a hook to additional information about the conference, as you can see with a search on #CDHC20.

Let’s say that you could only find hashtags for 15% of the conferences. If you were storing the full dataset in relational tables, is it worth adding a new column to the conferences table to store this value that will be blank for 85% of the rows? In this particular case, it’s not even up to me. I would have to convince the team at Carnegie Mellon to add this column to their conferences table and populate it.

With RDF, I don’t have to worry about any of this. I can create the data when I have it as more triples like this:

<http://rdfdata.org/dha/conference/i170> dha:hashtag "#CDHC20" .

(RDF geek note: Instead of storing the hash tag as a literal string value I was tempted to do it as the URL for the Twitter search because resource URIs as objects can then link to other resources. I left it as a string value because the same hashtag might be used with other social media such as Instagram.)

Linking to other data sets out there (Linked Data!)

I can also add triples of data that enrich the metadata stored with the project. For example, the RDF I created shows that seven works have a keyword value of http://rdfdata.org/dha/keyword/i6995, which has the label “TEI”. Wikipedia tells us that the Text Encoding Initiative is “a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s”. They’ve been putting classic works of literature, along with copious metadata, into XML ever since XML was a four-letter word.

If the Text Encoding Initiative has a Wikipedia page, then it also has triples in Wikidata. These show the project’s Twitter handle, its Library of Congress authority ID, its home page, and much more. Just as I added the hashtag value for the Chesapeake Digital Humanities conference above with a triple, I can add another triple that connects the Index of Digital Humanities Conferences URI for TEI to all that great information about it in Wikidata:

<http://rdfdata.org/dha/keyword/i6995> dha:wikidata <http://www.wikidata.org/entity/Q780920> .

This makes the available metadata about the seven Digital Humanities Conferences works tagged this way much richer.

Easy federation and integration of new data

This goal blurs a bit with “Linking to other data sets out there” described above, because if you can link to a dataset with a SPARQL endpoint such as Wikidata then you can send it a CONSTRUCT query and retrieve data from it to store with your local data. The “Using standards instead of ad-hoc namespaces” section of part one of this blog entry was another step toward this kind of integration, because much of the point of using shared vocabularies is the ability to connect your data to other datasets that use the same vocabularies.

Other data sources offer interesting potential connections to the Digital Humanities conference data. One is the Virtual International Authority File, or VIAF. This has some fairly official data about authors and their works that you can retrieve in RDF. Author names may not always be completely unique, but looking at this data I realized that many authors are self-disambiguating–if your name is “John Smith”, and you know that many other authors have that name, if your middle name is Francis you may choose to use “John Francis Smith” or some variation such as “J. Frank Smith” or “Jack F. Smith” as your author name to make it easier for people to find the work that you wrote.

The RDF that my script generated from the Carnegie Mellon data included this in appellations.ttl:

<http://rdfdata.org/dha/appellation/i13>
        rdf:type        dha:Appellation ;
        dha:id          "13" ;
        dha:first_name  "A. Charles" ;
        dha:last_name   "Muller" .

VIAF has A. Charles Muller at https://viaf.org/viaf/117299466/#Muller,_A._Charles,_1953-, with 117299466 being their database’s unique identifier for this author. We can use that identifier to create the URL https://viaf.org/viaf/117299466/rdf.xml and then download 111 triples about him. We can also download various versions of the entire VIAF dataset, but that is too many gigabytes for me to do some quick experiments with. If it was loaded into a triple store, a SPARQL query that concatenates the dha:first_name and dha:last_name values above could help to automate the connection of conference paper authors to VIAF records.

Inferencing: finding new facts and connections

Authors of the conference papers made up their own keywords to assign to their works instead of selecting from a curated taxonomy, so it’s one big flat list. I did a little curation myself to give the list some hierarchy that would make it easier to find relationships between more relevant papers.

There were over two dozen keywords that had some variation on “TEI” or “Text Encoding Initiative” as their keywords. In my github project’s newrdf directory I added some triples to the SKOS scheme I described in part one called keywordScheme. The modelTriples.ttl file in that directory begins like this:

@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .
@prefix dha:   <http://rdfdata.org/dha/ns/dh-abstracts/> .
@prefix dhak:  <http://rdfdata.org/dha/keyword/> .

dhak:r10001 a               skos:Concept ;
            skos:inScheme   dha:keywordScheme ;
            skos:prefLabel  "Text Encoding Initiative (TEI)" .

dhak:i1100 skos:broader dhak:r10001 . # generated tei"
dhak:i2639 skos:broader dhak:r10001 . # tei and structural markup"
dhak:i2641 skos:broader dhak:r10001 . # tei encoding"
dhak:i2642 skos:broader dhak:r10001 . # tei markup"

First, it defines a new SKOS concept called “Text Encoding Initiative (TEI)”. The triples that follow that say that each of the relevant SKOS concepts generated from the Carnegie Mellon CSV by my automated conversion has this new one as its skos:broader value, just as “dachshund” in an animal taxonomy might have a broader value of “dog” to group together the different breeds. After the dhak:i2642 triple shown above there are 22 more about other TEI-related keywords. (I was tempted to automate the creation of all of these by looking for a substring of “tei” in the generated keyword concepts, but existing keywords like “Wittgenstein” and “Frankenstein” showed me that this was a bad idea.)

The git repository where I stored all the files for this conversion project has a readme file that shows some queries demonstrating the value added by this additional data modeling of the otherwise flat keyword list. A SPARQL query for all the works tagged “tei” retrieves a list of 90 of them. A query for all works tagged with something in the taxonomic subtree of “Text Encoding Initiative (TEI)” finds 132, so adding a little bit of semantics in the form of explicit relationships between related topics made it possible to find more papers about the TEI. A third query in the readme counts how many TEI-related papers were submitted each year for results that could be turned into a chart of the TEI’s popularity at these conferences over time:

The “inferencing” here is the deduction, based on the little bit of modeling that I did, of connections that were not otherwise explicit between resources described by the dataset.

The triples in modelTriples.ttl that enable this, like the RDF triples about conference hash tags, demonstrate how RDF can add value to a dataset that is outside of the control of the person doing the adding. As long as the id values in the original database keep identifying the same things, we can turn them into URIs that let us connect new kinds of data to the original dataset. It’s another great example of the new possibilities that become available when you use RDF to store your data.

Comments? Reply to my tweet announcing this blog entry.

SPARQLing anything

Querying for audio on Wikidata

Use SPARQL to query for movies, then watch them

SPARQL queries of the Billboard Hot 100

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Querying for labels

Human-readable names in RDF

blog

home