Validating RDF data with SHACL

Setting some constraints--then violating them!

shackles

Last month, in The W3C standard constraint language for RDF: SHACL, I described the history of this new standard that lets us define constraints on RDF data and an open source tool that lets us identify where such constraints were violated. The presence of the standard and tools that let us implement the standard will be a big help to the use of RDF in production environments.

There’s a lot you can do with SHACL–enough that the full collection of features available and their infrastructure can appear a bit complicated. I wanted to create some simple constraints for some simple data and then use the shaclvalidate.sh tool to identify which parts of the data violated which constraints, and it went very nicely.

I started by going through a TopQuadrant tutorial that builds some SHACL exercises using their TopBraid Composer GUI tool (free edition available by selecting “Free Edition” from the “Product” field on the TopBraid Composer Installation page). Then, after I examined the triples that Composer generated when I followed the tutorial’s steps, I created my own new example called employees.ttl to run with shaclvalidate.sh. (To make my example as stripped-down as possible, I used a text editor for this, not Composer.) You can download my file right here; below I describe the file a few lines at a time to show what I was doing and how the pieces fit together.

I started off with declarations for prefixes, a class, and a few properties for that class:

@prefix hr: <http://learningsparql.com/ns/humanResources#> .
@prefix d:  <http://learningsparql.com/ns/data#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .


#### Regular RDFS modeling ####


hr:Employee a rdfs:Class .


hr:name
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee .


hr:hireDate
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee ;
   rdfs:range xsd:date .


hr:jobGrade
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee ;
   rdfs:range xsd:integer .

There is nothing new and interesting there, but it’s worth reviewing why these declarations are useful: so that applications using instances of this class know more about it and can do more with it. For example, when generating a form to let users edit Employee instances, an application noting that hr:hireDate has an rdfs:range of xsd:date might provide a date-picking widget on the form instead of just providing a text field to fill out. (And, if the application sees an additional property for this class declared someday, it can automatically generate a field for the new property on the edit form, so that this model really is driving application behavior.) These rdfs:range values are not there so that an automated process can check whether that instance data conforms to these types, although some applications may have done that. This is the hole that SHACL fills, as we will see below.

  #### Additional SHACL modeling ####


hr:Employee
# Following two lines are an alternative to the line above
#hr:EmployeeShape
#  sh:targetClass hr:Employee ;
   a sh:NodeShape ;
   sh:property hr:nameShape ;
   sh:property hr:jobGradeShape .


hr:nameShape
   sh:path hr:name ;
   sh:datatype xsd:string ;
   sh:minCount 1 ;
   sh:maxCount 1 .


hr:jobGradeShape
   sh:path hr:jobGrade ;
   sh:datatype xsd:integer ;
   sh:minCount 1 ;
   sh:maxCount 1 ;
   sh:minInclusive 1;
   sh:maxInclusive 7 .

The SHACL vocabulary is associated here with the prefix sh:. Some of the best documentation of this vocabulary is right where it should be–in rdfs:comment values of the class and property declarations in https://www.w3.org/ns/shacl.ttl. (As we’ll see, the spec itself is also a good place to find out what’s what.)

Above, we see that hr:Employee, which had already been declared to be an rdfs:Class, is also declared to be an sh:NodeShape. To quote a few of the shacl.ttl vocabulary file’s rdfs:comment values, “a shape is a collection of constraints that may be targeted for certain nodes,” “a node shape is a shape that specifies constraint [sic] that need to be met with respect to focus nodes,” and (quoting the spec this time) “an RDF term that is validated against a shape using the triples from a data graph is called a focus node.” So, declaring hr:Employee to also be a sh:NodeShape lets it serve as a collection of constraints for certain nodes.

Note the commented-out alternative lines after that first one. Instead of making the existing hr:Employee class also serve as a collection of constraints for instances of that class, we could declare a separate new class as an instance of sh:NodeShape (in the commented-out example, a new instance called hr:EmployeeShape) and go on to define the constraints there. How would the validator know that hr:EmployeeShape was storing constraints for the hr:Employee class? Because, as the last commented-out line shows, its sh:targetClass property would point to the hr:Employee class. (Thanks to my former TopQuadrant colleague Holger for helping me to understand how that works.)

After naming the place to store the constraints, we create some using the SHACL vocabulary’s sh:property property. The rdfs:comment for this property in shacl.ttl tells us that it “Links a shape to its property shapes.” In the SHACL files created by TopBraid Composer, it links to property shapes grouped together with blank nodes, but as you can see above, I pointed them at shapes for the Employee name and jobGrade properties that have their own URIs.

The hr:nameShape and hr:jobGradShape property shapes above are pretty self-explanatory. To show that one value for each must be included with each instance of hr:Employee, I gave each an sh:minCount and a sh:maxCount value of 1. The property shapes also have data types specified, and unlike the use of the rdfs:range specifications for these properties above, these will be used for validation. For hr:jobGradeShape, I also added sh:minInclusive and sh:maxInclusive values to restrict any data values to be from 1 to 7.

The last part of employees.ttl has four instances of hr:Employee. The first meets all the defined constraints:

d:e1
   a hr:Employee;
   hr:name "Barry Wom" ;
   hr:hireDate "2017-06-03" ;
   hr:jobGrade 6 .

When I comment out the other three instances and run shaclvalidate on the file, it gives me back a validation report, in the form of triples, about how everything is cool:

  @prefix sh:    <http://www.w3.org/ns/shacl#> .


[ a            sh:ValidationReport ;
  sh:conforms  true
] .

The next instance lacks the required hr:jobGrade value:

d:e2
   a hr:Employee;
   hr:name "Ron Nasty" ;
   hr:hireDate "2017-08-11" . 

After I uncommented this instance in employees.ttl, shaclvalidate told me this about it:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix hr:    <http://learningsparql.com/ns/humanResources#> .


[ a            sh:ValidationReport ;
  sh:conforms  false ;
  sh:result    [ a                             sh:ValidationResult ;
                 sh:focusNode                  d:e2 ;
                 sh:resultMessage              "Less than 1 values" ;
                 sh:resultPath                 hr:jobGrade ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:MinCountConstraintComponent ;
                 sh:sourceShape                hr:jobGradeShape
               ]
] .

As I mentioned last month, returning these validation reports as triples makes it easier to plug the process into a larger automated workflow, and here we see that when constraints are violated, the triples include information to incorporate into that larger workflow–for example, to build a message to display in a pop-up message box. You could also query accumulated validation reports with SPARQL to identify patterns of what kinds of violations happened how often.

The third employee tests the SHACL validator’s ability to detect data type violations, because the hr:jobGrade value is not an integer:

  d:e3
   a hr:Employee;
   hr:name "Stig O'Hara" ;
   hr:hireDate "2017-03-14" ;
   hr:jobGrade 3.14 .

shaclvalidate does just fine with that:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix hr:    <http://learningsparql.com/ns/humanResources#> .


[ a            sh:ValidationReport ;
  sh:conforms  false ;
  sh:result    [ a                             sh:ValidationResult ;
                 sh:focusNode                  d:e3 ;
                 sh:resultMessage              "Value does not have datatype xsd:integer" ;
                 sh:resultPath                 hr:jobGrade ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:DatatypeConstraintComponent ;
                 sh:sourceShape                hr:jobGradeShape ;
                 sh:value                      3.14
               ]
] .

The last employee instance tests the SHACL validator’s ability to detect a value that falls outside of a specified range, because hr:jobGrade is greater than 7:

d:e4
   a hr:Employee;
   hr:name "Dirk McQuickly" ;
   hr:hireDate "2017-01-08" ;
   hr:jobGrade 8 .

This isn’t a problem either:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix hr:    <http://learningsparql.com/ns/humanResources#> .


[ a            sh:ValidationReport ;
  sh:conforms  false ;
  sh:result    [ a                             sh:ValidationResult ;
                 sh:focusNode                  d:e4 ;
                 sh:resultMessage              "Value is not <= 7" ;
                 sh:resultPath                 hr:jobGrade ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:MaxInclusiveConstraintComponent ;
                 sh:sourceShape                hr:jobGradeShape ;
                 sh:value                      8
               ]
] .

I deliberately picked simple examples to see how difficult they would be to implement, and as with many powerful software systems, my only problem was navigating the detailed documentation of the architecture and many features to find the parts that I wanted.

What other built-in constraints are available besides sh:datatype, sh:minCount, sh:maxCount, sh:minInclusive, and sh:maxInclusive? See for yourself in section 4 of the spec: Core Constraint Components. (For a nice quick skim of the available constraints, just look through that section’s entries in the spec’s table of contents.)

If you’ve done much work with RDF, you’re going to enjoy this.

1912 farm and garden supply catalog image courtesy of flickr