SPARQL queries of git repository data

If we're going to think of git data as a graph...

SPARQL and Git logos

Justin Dowdy recently created an open source project to convert the metadata in a git repository to RDF, and I’ve been having some fun with it. Before getting into the details, as a brief demo I’ll start with a sample SPARQL query that I did to list all of the 2019 commits in my misc github repo:

PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX wd:      <http://www.wikidata.org/entity/> 
PREFIX x:       <http://www.w3.org/2001/XMLSchema#>
PREFIX gist:    <https://ontologies.semanticarts.com/gist/> 

SELECT ?title ?dateTime WHERE {
  ?commit a wd:Q20058545 ;  # it's an instance of the commit class
            dcterms:subject ?subject ;
            gist:atDateTime ?dateTime . 
  ?subject  dcterms:title ?title .
  FILTER (?dateTime >= "2019-01-01T00:00:00"^^x:dateTime && 
          ?dateTime < "2020-01-01T00:00:00"^^x:dateTime)
}

It produced this result:

title                                     dateTime
-----                                     --------
adding sqlite rdf files                   2019-07-13T16:19:39-04:00
added tableList.scr                       2019-07-13T16:21:39-04:00
adding readme                             2019-07-28T12:00:55-04:00
added files to go with 2019-10 blog entry 2019-10-20T16:46:07-04:00

Justin’s software that makes this all possible is at https://github.com/justin2004/git_to_rdf.

Once I installed that software and created a /home/bob/temp/rdf directory, the following variation on the command line from Justin’s github page read my local copy of the misc repo and put 35,353 triples about it in two files in /mnt/temp/rdf:

/home/bob/git/git_to_rdf/git_to_rdf.sh \
  --repository /mnt/git/misc  --output /mnt/temp/rdf

(Referencing /home/bob/temp/rdf as /mnt/temp/rdf is a Docker thing that I don’t completely understand myself. Justin said that he is working to simplify that.) I loaded the new triples into Jena Fuseki and tried a few of my Queries to explore a dataset that I typically use, which is how I found out that it had 35K triples.

To really understand the possibilities, read Justin’s blog entry Git Repositories as RDF Graphs. I especially like how it explained that he didn’t necessarily have to make “thoughtful” RDF (well-modeled RDF that takes advantage of standard vocabularies) and why and how he did so. His blog entry also includes a nice diagram of his data model, generated with RDFox, that you’ll want to keep handy while you develop any queries for you own git repo data converted to RDF.

Several of his sample queries will be especially useful for querying git repos that have commits from multiple people. He demonstrates these with RDF generated from the repo for the cURL utility that I have written about here many times. My misc repo that I used to generate RDF only has commits from me, so these sample queries were less useful to me, but they still provided a good model for how to get at certain kinds of repo information.

To build on what he wrote there I wanted to create at least one more query that was different from his examples, so I created this one to find the commits that used blocks of text with the word “music” in them:

PREFIX wd:      <http://www.wikidata.org/entity/> 
PREFIX gist:    <https://ontologies.semanticarts.com/gist/> 
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?commitTitle ?commitTime ?filename ?textLine  WHERE {
  
  ?commit a wd:Q20058545 ; # it's a commit
          gist:hasPart ?part ;
          dcterms:subject ?commitSubject ;
          gist:atDateTime ?commitTime . 
  
  ?commitSubject dcterms:title ?commitTitle .
  
  ?part gist:produces  ?contiguousLines .
  
  ?contiguousLines gist:occursIn ?file ; 
                   <http://example.com/containedTextContainer> ?textContainer . 
  
  ?file gist:name ?filename .
    ?textContainer ?line ?textLine .
  
  FILTER(contains(?textLine,"music"))
}

And here is the result:

query result

This combination of the world’s most popular version control system and this ability to to manipulate metadata about what it contains could provide the basis for a Content Management System in the broader original sense of the term: something to manage the storage and workflow of multiple kinds of content for multiple kinds of publication media. (In recent years the term’s meaning has narrowed to mean “platform to help automate web publishing”.)

That’s just one of the possibilities. Read Justin’s blog entry and see what ideas it gives you!


Comments? Reply to my tweet announcing this blog entry.