Last month, in The W3C standard constraint language for RDF: SHACL, I described the history of this new standard that lets us define constraints on RDF data and an open source tool that lets us identify where such constraints were violated. The presence of the standard and tools that let us implement the standard will be a big help to the use of RDF in production environments.
There’s a lot you can do with SHACL–enough that the full collection of features available and their infrastructure can appear a bit complicated. I wanted to create some simple constraints for some simple data and then use the
shaclvalidate.sh tool to identify which parts of the data violated which constraints, and it went very nicely.
I started by going through a TopQuadrant tutorial that builds some SHACL exercises using their TopBraid Composer GUI tool (free edition available by selecting “Free Edition” from the “Product” field on the TopBraid Composer Installation page). Then, after I examined the triples that Composer generated when I followed the tutorial’s steps, I created my own new example called
employees.ttl to run with
shaclvalidate.sh. (To make my example as stripped-down as possible, I used a text editor for this, not Composer.) You can download my file right here; below I describe the file a few lines at a time to show what I was doing and how the pieces fit together.
I started off with declarations for prefixes, a class, and a few properties for that class:
@prefix hr: <http://learningsparql.com/ns/humanResources#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix sh: <http://www.w3.org/ns/shacl#> . #### Regular RDFS modeling #### hr:Employee a rdfs:Class . hr:name rdf:type rdf:Property ; rdfs:domain hr:Employee . hr:hireDate rdf:type rdf:Property ; rdfs:domain hr:Employee ; rdfs:range xsd:date . hr:jobGrade rdf:type rdf:Property ; rdfs:domain hr:Employee ; rdfs:range xsd:integer .
There is nothing new and interesting there, but it’s worth reviewing why these declarations are useful: so that applications using instances of this class know more about it and can do more with it. For example, when generating a form to let users edit Employee instances, an application noting that
hr:hireDate has an
xsd:date might provide a date-picking widget on the form instead of just providing a text field to fill out. (And, if the application sees an additional property for this class declared someday, it can automatically generate a field for the new property on the edit form, so that this model really is driving application behavior.) These
rdfs:range values are not there so that an automated process can check whether that instance data conforms to these types, although some applications may have done that. This is the hole that SHACL fills, as we will see below.
#### Additional SHACL modeling #### hr:Employee # Following two lines are an alternative to the line above #hr:EmployeeShape # sh:targetClass hr:Employee ; a sh:NodeShape ; sh:property hr:nameShape ; sh:property hr:jobGradeShape . hr:nameShape sh:path hr:name ; sh:datatype xsd:string ; sh:minCount 1 ; sh:maxCount 1 . hr:jobGradeShape sh:path hr:jobGrade ; sh:datatype xsd:integer ; sh:minCount 1 ; sh:maxCount 1 ; sh:minInclusive 1; sh:maxInclusive 7 .
The SHACL vocabulary is associated here with the prefix
sh:. Some of the best documentation of this vocabulary is right where it should be–in
rdfs:comment values of the class and property declarations in https://www.w3.org/ns/shacl.ttl. (As we’ll see, the spec itself is also a good place to find out what’s what.)
Above, we see that
hr:Employee, which had already been declared to be an
rdfs:Class, is also declared to be an
sh:NodeShape. To quote a few of the
shacl.ttl vocabulary file’s
rdfs:comment values, “a shape is a collection of constraints that may be targeted for certain nodes,” “a node shape is a shape that specifies constraint [sic] that need to be met with respect to focus nodes,” and (quoting the spec this time) “an RDF term that is validated against a shape using the triples from a data graph is called a focus node.” So, declaring
hr:Employee to also be a
sh:NodeShape lets it serve as a collection of constraints for certain nodes.
Note the commented-out alternative lines after that first one. Instead of making the existing
hr:Employee class also serve as a collection of constraints for instances of that class, we could declare a separate new class as an instance of
sh:NodeShape (in the commented-out example, a new instance called
hr:EmployeeShape) and go on to define the constraints there. How would the validator know that
hr:EmployeeShape was storing constraints for the
hr:Employee class? Because, as the last commented-out line shows, its
sh:targetClass property would point to the
hr:Employee class. (Thanks to my former TopQuadrant colleague Holger for helping me to understand how that works.)
After naming the place to store the constraints, we create some using the SHACL vocabulary’s
sh:property property. The
rdfs:comment for this property in
shacl.ttl tells us that it “Links a shape to its property shapes.” In the SHACL files created by TopBraid Composer, it links to property shapes grouped together with blank nodes, but as you can see above, I pointed them at shapes for the Employee name and jobGrade properties that have their own URIs.
hr:jobGradShape property shapes above are pretty self-explanatory. To show that one value for each must be included with each instance of
hr:Employee, I gave each an
sh:minCount and a
sh:maxCount value of 1. The property shapes also have data types specified, and unlike the use of the
rdfs:range specifications for these properties above, these will be used for validation. For
hr:jobGradeShape, I also added
sh:maxInclusive values to restrict any data values to be from 1 to 7.
The last part of
employees.ttl has four instances of
hr:Employee. The first meets all the defined constraints:
d:e1 a hr:Employee; hr:name "Barry Wom" ; hr:hireDate "2017-06-03" ; hr:jobGrade 6 .
When I comment out the other three instances and run shaclvalidate on the file, it gives me back a validation report, in the form of triples, about how everything is cool:
@prefix sh: <http://www.w3.org/ns/shacl#> . [ a sh:ValidationReport ; sh:conforms true ] .
The next instance lacks the required
d:e2 a hr:Employee; hr:name "Ron Nasty" ; hr:hireDate "2017-08-11" .
After I uncommented this instance in
employees.ttl, shaclvalidate told me this about it:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix sh: <http://www.w3.org/ns/shacl#> . @prefix hr: <http://learningsparql.com/ns/humanResources#> . [ a sh:ValidationReport ; sh:conforms false ; sh:result [ a sh:ValidationResult ; sh:focusNode d:e2 ; sh:resultMessage "Less than 1 values" ; sh:resultPath hr:jobGrade ; sh:resultSeverity sh:Violation ; sh:sourceConstraintComponent sh:MinCountConstraintComponent ; sh:sourceShape hr:jobGradeShape ] ] .
As I mentioned last month, returning these validation reports as triples makes it easier to plug the process into a larger automated workflow, and here we see that when constraints are violated, the triples include information to incorporate into that larger workflow–for example, to build a message to display in a pop-up message box. You could also query accumulated validation reports with SPARQL to identify patterns of what kinds of violations happened how often.
The third employee tests the SHACL validator’s ability to detect data type violations, because the
hr:jobGrade value is not an integer:
d:e3 a hr:Employee; hr:name "Stig O'Hara" ; hr:hireDate "2017-03-14" ; hr:jobGrade 3.14 .
shaclvalidate does just fine with that:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix sh: <http://www.w3.org/ns/shacl#> . @prefix hr: <http://learningsparql.com/ns/humanResources#> . [ a sh:ValidationReport ; sh:conforms false ; sh:result [ a sh:ValidationResult ; sh:focusNode d:e3 ; sh:resultMessage "Value does not have datatype xsd:integer" ; sh:resultPath hr:jobGrade ; sh:resultSeverity sh:Violation ; sh:sourceConstraintComponent sh:DatatypeConstraintComponent ; sh:sourceShape hr:jobGradeShape ; sh:value 3.14 ] ] .
The last employee instance tests the SHACL validator’s ability to detect a value that falls outside of a specified range, because
hr:jobGrade is greater than 7:
d:e4 a hr:Employee; hr:name "Dirk McQuickly" ; hr:hireDate "2017-01-08" ; hr:jobGrade 8 .
This isn’t a problem either:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix sh: <http://www.w3.org/ns/shacl#> . @prefix hr: <http://learningsparql.com/ns/humanResources#> . [ a sh:ValidationReport ; sh:conforms false ; sh:result [ a sh:ValidationResult ; sh:focusNode d:e4 ; sh:resultMessage "Value is not <= 7" ; sh:resultPath hr:jobGrade ; sh:resultSeverity sh:Violation ; sh:sourceConstraintComponent sh:MaxInclusiveConstraintComponent ; sh:sourceShape hr:jobGradeShape ; sh:value 8 ] ] .
I deliberately picked simple examples to see how difficult they would be to implement, and as with many powerful software systems, my only problem was navigating the detailed documentation of the architecture and many features to find the parts that I wanted.
What other built-in constraints are available besides
sh:maxInclusive? See for yourself in section 4 of the spec: Core Constraint Components. (For a nice quick skim of the available constraints, just look through that section’s entries in the spec’s table of contents.)
If you’ve done much work with RDF, you’re going to enjoy this.
1912 farm and garden supply catalog image courtesy of flickr