Hidden gems included with Jena’s command line utilities

Lots of ways to manipulate your RDF from the open-source multiplatform tool kit

On page 5 of my book Learning SPARQL I described how the open source RDF processing framework Apache Jena includes command line utilities called arq and sparql that let you run SPARQL queries with a simple command line like this:

arq --data mydata.ttl --query myquery.rq

At the time, the arq one supported some SPARQL extensions that the sparql one didn’t. I don’t even remember what they were and tended to use arq just because the name is shorter. I have since learned that with support for the extensions being added to sparql, there are now no particular differences between the two.

Jena (which recently celebrated release 4.0.0) includes Linux and Windows versions of many other utilities in addition to arq and sparql. I’ve mentioned several here when I used one or another to accomplish a particular task, and I thought it would be nice to summarize some of the ones that I have and have not mentioned before. I may be repeating some earlier explanations, but it should be handy to have them in one place.

You’ll find Linux utilities such as arq and shacl in Jena’s bin directory and corresponding Windows utilities such as arq.bat and shacl.bat in its bat directory.

Remember that, like arq and sparql, many of these support additional command line parameters beyond the ones I show here. Use --help with each to find out more. I tried to demo what I found to be the most useful about each.

You can find more background about some of these utilities on the Jena documentation pages ARQ - Command Line Applications (which covers more than just arq) and the “Command line tools” section of the Reading and Writing RDF in Apache Jena page.

And thanks to Andy Seaborne for reviewing a draft of this!

rdfdiff

Use the rdfdiff utility to compare two dataset files. It’s like the venerable UNIX command diff, except that it looks for different triples instead of lines. The order of the input triples doesn’t matter to rdfdiff, and it can compare data files in different serializations. For example, here is a little RDF/XML file:

<!-- joereceiving.rdf -->
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:d="http://whatever/" > 
  <rdf:Description rdf:about="http://whatever/emp3">
    <d:dept>receiving</d:dept>
    <d:name>joe</d:name>
    <d:insurance rdf:resource="http://www.uhc.com"/>
  </rdf:Description>
</rdf:RDF>

Here is a Turtle file with roughly the same information:

# joereceiving.ttl

@prefix w: <http://whatever/> .

w:emp3 w:name "Joseph" ;
       w:dept "receiving" ;
       w:insurance <http://www.uhc.com> .

I ran this command to compare the two, also including the names of their formats:

rdfdiff joereceiving.rdf joereceiving.ttl RDF/XML TURTLE

I got this output:

< [http://whatever/emp3, http://whatever/name, "joe"]
> [http://whatever/emp3, http://whatever/name, "Joseph"]

Like the text file comparison utility diff, the report uses < as a prefix to show you what was in the first file but not the second and > to show you what was in the second but not the first.

As with many other Jena utilities, you can use the URL of a remote file instead of the name of a local file for either or both of the first two arguments.

shacl

In Validating RDF data with SHACL I described how to use an open source tool developed by TopQuadrant to validate RDF data against constraints on that data that are described using the W3C SHACL standard. Jena includes a shacl utility to do the same kind of validation, and when running this with the employees.ttl file that that blog entry links to, all of my examples described there work with Jena shacl as well.

Because the employees.ttl file had class definitions, instance data, and SHACL shapes all defined within that one file, I passed that filename as both the --data and --shapes parameter when I ran this command line tool:

shacl validate --data employees.ttl --shapes employees.ttl

It found all of my test constraint violations:

  • After I uncommented the data’s e2 example, shacl reported that it was missing the required hr:jobGrade value.
  • After I uncommented the e3 example, it reported that its hr:jobGrade value was not an integer.
  • After I uncommented the e4 example, it reported that its hr:jobGrade value fell out of the allowed range.

As the SHACL specification requires, the validation reports produced by shacl were themselves sets of triples, whether it found violations or not. This makes it easier to fit the tool into an RDF processing pipeline.

Adding -v for “verbose” after shacl validate in that command line adds additional information to the output.

The utility’s print option outputs the rules in the file. It can do this as regular RDF, compact SHACL syntax (surprisingly useful if you have a lot of rules), or the default: a simple text representation.

shacl print --out=RDF employees.ttl     # out=RDF, compact, or text

qparse and uparse

The qparse utility parses a query and can do various things with it as described by its --help option. I recently learned that it can pretty-print queries, so if the spacing and indentation of a query that you’re trying to understand is a mess, qparse can make it easier to understand and even capitalize keywords and add line numbers.

Here is a sloppily formatted little query:

# namedept.rq
prefix w: <http://whatever/> Select
* WHERE { ?s w:name ?name . optiONAL {       ?s w:dept ?dept } }

I run this command,

qparse --query namedept.rq

and I get this output:

PREFIX  w:    <http://whatever/>

SELECT  *
WHERE
  { ?s  w:name  ?name
    OPTIONAL
      { ?s  w:dept  ?dept }
  }

Adding --num to the command line would add line numbers to the output.

The uparse utility can do the same thing for update queries. The following pretty-prints the file updatetest.ru:

uparse --file=updatetest.ru

Further documentation about both commands is available in the Jena documentation.

rsparql

This sends a local query to a SPARQL endpoint specified with a URL. I would typically use curl for this, but after reviewing the --help options for rsparql I see that it makes it easier to specify that you want the results in text, XML, JSON, CSV, or TSV. When sending a SPARQL query with curl, you can’t assume that the endpoint supports all of these result formats, and you probably have to look up their mime types, because I certainly haven’t memorized them.

The following sends the SPARQL query in the 5triples.rq file to the Wikidata endpoint and then outputs the results at the command line:

rsparql --query 5triples.rq --service=https://query.wikidata.org/sparql

rupdate

This send a local update query to a SPARQL endpoint specified with a URL. It will have to be one where you have update permission, which may well be a locally running copy of Fuseki. The following executes the update request stored in updatetest.ru on the test1 dataset in the locally running copy of Fuseki (assuming that fuseki-server was started up with the --update parameter, as described below):

rupdate --service=http://localhost:3030/test1 --update=updatetest.ru

rdfparse

This parses an RDF/XML document. People don’t use RDF/XML much anymore, and with good reason, but if you find any RDF/XML this is a simple way to convert it. The riot utility, described below, is even better, but I especially like the -R switch available with rdfparse; this tells it to search through an arbitrary XML document and extract any triples stored within embedded rdf:RDF elements. That can be great for processing some RDF that was embedded into XML before JSON-LD or even RDFa were around. Here’s a nice arbitrary XML document that I called xproduct1.xml:

<myDoc>

  <header><whatev/></header>

  <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:d="http://whatever/" > 
    <rdf:Description rdf:about="http://whatever/emp1">
      <d:dept>shipping</d:dept>
      <d:name>jane</d:name>
    </rdf:Description>
  </rdf:RDF>

  <arbitraryElement/>

  <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:d="http://whatever/" > 
    <rdf:Description rdf:about="http://whatever/emp3">
      <d:dept>receiving</d:dept>
      <d:name>joe</d:name>
    </rdf:Description>
  </rdf:RDF>

</myDoc>

I run the following command,

rdfparse -R xproduct1.xml 

and it produces this nice ntriples output:

http://whatever/emp1> <http://whatever/dept> "shipping" .
<http://whatever/emp1> <http://whatever/name> "jane" .
<http://whatever/emp3> <http://whatever/dept> "receiving" .
<http://whatever/emp3> <http://whatever/name> "joe" .

Working with Fuseki datasets from the command line

Jena includes several utilities that let you work with datasets created using Jena’s Fuseki SPARQL server. Their ability to load and update data can be very helpful in an automated system that uses Fuseki as its backend data store.

To create some of this data to test with, I used the following command to start up Fuseki in a mode that would allow updates to data that it was storing:

fuseki-server --update

When you go to Fuseki’s GUI interface at http://localhost:3030 and tell it that you want to create a new dataset, you have to choose between three types of dataset: in-memory ones that will not persist from session to session, “Persistent” ones that use the older TDB format, and “Persistent (TDB2)” ones that use the more advanced TDB2 format. For my examples below I just created TDB2 datasets. TDB versions of the commands are also included with Jena, but if you’re creating a new dataset, you may as well use TDB2.

Most of these utilities expect you to specify a path to an assembler file to tell those utilities which Fuseki dataset to operate on. I never tried making my way through the Jena Assembler howto documentation, but I recently noticed that Fuseki creates assembler files for us, so I don’t have to worry about their structure and syntax because I can have Fuseki make them for me. When I used Fuseki’s GUI to create a TDB2 dataset called test1, Fuseki created the assembler file apache-jena-fuseki/run/configuration/test1.ttl, so I knew where to point the command line utilities.

These command line tools won’t work with the Fuseki datasets if you have Fuseki running because Fuseki locks the files. My examples below assume that I have created the test1 dataset describe above, used the web-based interface to upload data to it (although, as we’ll see, this can be done with command line tools as well), and then shut down the Fuseki server.

Additional information about these commands is available at TDB2 - Command Line Tools.

Dumping dataset contents

The following command showed me the contents of that TDB2 dataset at the command line:

tdb2.tdbdump --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl 

Querying a Fuseki dataset

With a SPARQL query stored in myquery.rq, this command queries the test1 dataset and outputs the results at the command line:

tdb2.tdbquery --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl --query myquery.rq

Setting of the output format is similar to doing it with arq. Run tdb2.tdbquery --help to find out more.

Updating a Fuseki dataset

With the file updatetest.ru storing a SPARQL INSERT update request that inserts a single triple, the following command didn’t show anything at the command line,

tdb2.tdbupdate --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl --update updatetest.ru

but when I restarted the Fuseki server and used the web-based interface to query dataset test1 for all of its triples, I saw the triple inserted by the updatetest.ru query in there with the triples that had been in there before.

Loading a data file into a Fuseki dataset

The following loaded the triples in the file furniture.ttl into the test1 dataset (which I confirmed the same way I did with my previous example) and displayed some status messages:

tdb2.tdbloader --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl furniture.ttl

It’s best to make sure that there are no parsing problems with the file you load before you load it. A quick way to do that is with the --validate parameter of the riot command:

riot --validate furniture.ttl

Other command line utilities for Fuseki datasets

The following commands all work on the dataset whose assembler file you point to with the --tdb parameter:

  • tdb2.tdbstats outputs a LISPy set of parenthesized expressions telling you about the dataset.

  • tdb2.tdbbackup creates a gzipped copy of the dataset’s triples.

  • I tried tdb2.tdbcompact and got a status message of “Compacted in 0.570s”; someday I’ll try this with a larger dataset to really investigate the effect.

riot

Jena includes many command line utilities that I won’t describe here because riot (“RDF I/O Technology”) combines them all into one utility that I have been using more and more lately. I mentioned in Pulling Turtle RDF triples from the Google Knowledge Graph how it can accept triples via standard input, which was great for the use case that I described there of converting Google Knowledge Graph JSON-LD to Turtle triples on the fly.

We’ve already seen another nice use of riot above: validating a file of triples before loading it into dataset stored on a server.

Converting serializations

To simply convert an RDF file from one serialization to another, use the riot --output parameter to name the new serialization:

riot --output=JSONLD emps.ttl

The Jena utilities nquads, ntriples, rdfxml, trig, and turtle are all specialized versions of riot that produce the named serializations with no need for an --output parameter.

Counting triples

When I want to know how many triples are in a Turtle file, here’s what I usually do:

  1. Look around my hard disk for a query file that uses COUNT to count all the triples.
  2. Give up looking.
  3. Look up the COUNT syntax in my book “Learning SPARQL”.
  4. Write another query file for counting all the triples.

Now I can just use riot with this simple command line:

riot --count furniture.ttl

It also works with quads.

Concatenating

Jena includes an rdfcat utility that outputs the concatenated contents of any data files listed on its command line. First, it outputs a header that says “DEPRECATED: Please use ‘riot’ instead”. Providing multiple data file names as arguments when running riot (I think I just got another pun of the name) will by default output an ntriples version of their concatenated triples with status messages showing where each one starts. Adding --quiet suppresses the status messages, and --output lets you specify a different output serialization.

Inferencing

Jena includes an inferutility that does inferencing from an RDFS model, but I no longer bother with it because riot can do this as well. The following little RDFS model shows that two properties from the Oracle and Microsoft sample relational databases are subproperties of similar schema.org properties:

# empmodel.ttl
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix schema:   <http://schema.org/> . 
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw:       <http://snee.com/vocab/schema/SQLServerNorthwind#> .

oraclehr:employees_first_name rdfs:subPropertyOf schema:givenName  . 
oraclehr:employees_last_name  rdfs:subPropertyOf schema:familyName . 
nw:employees_FirstName        rdfs:subPropertyOf schema:givenName  . 
nw:employees_LastName         rdfs:subPropertyOf schema:familyName . 

Here is some data using the Oracle and Microsoft properties:

# emps.ttl
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix schema:   <http://schema.org/> . 
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw:       <http://snee.com/vocab/schema/SQLServerNorthwind#> .

oraclehr:employees_100 oraclehr:employees_last_name "King" ;
    oraclehr:employees_first_name "Steven" .

nw:employees_2 nw:employees_LastName "Fuller" ;
    nw:employees_FirstName "Andrew" . 

This command tells riot to do inferencing on emps.ttl using the RDFS modeling in empmodel.ttl:

riot --rdfs empmodel.ttl emps.ttl

And here is the ntriples result with spaces added for more readability:

<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://snee.com/vocab/schema/OracleHR#employees_last_name> "King" .
  
<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://schema.org/familyName> "King" .
  
<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://snee.com/vocab/schema/OracleHR#employees_first_name> "Steven" .
  
<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://schema.org/givenName> "Steven" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://snee.com/vocab/schema/SQLServerNorthwind#employees_LastName> "Fuller" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://schema.org/familyName> "Fuller" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://snee.com/vocab/schema/SQLServerNorthwind#employees_FirstName> "Andrew" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://schema.org/givenName> "Andrew" .

The new triples show that these employees have schema.org properties in addition to the original OracleHR and Northwind properties. This ability makes this kind of inferencing great for data integration, as I described in Driving Hadoop data integration with standards-based models instead of code. (In that I used the Python libray rdflib to do the same kind of inferencing, but that’s the beauty of standards—having a choice of tools to implement the same expected behavior.)