What else can I do with RDFS?

Schemas can be a little fancier and even more useful with no need for OWL.

In my last blog entry, What is RDFS?, I described how the RDF Schema language lets you define RDF vocabularies, with the definitions themselves being RDF triples. We saw how simple class and property name definitions in a schema can, as machine-readable documentation for a dataset’s structure, provide greater interoperability for data and applications built around the same domain. Today we’ll look at how RDF schemas can store additional kinds of valuable information to add to what we saw in the sample schemas last time, and then we’ll look at some of the cool things that RDF schemas let you do.

More data modeling

When we use RDFS to define class and property names we can also define relationships between them. The following expands on the schema from last time to define relations between classes, between properties, and between classes and properties:

@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix vcard:   <http://www.w3.org/2006/vcard/ns#> .
@prefix emp:     <http://www.snee.com/schema/employees/> .
@prefix ex:      <http://www.snee.com/example/> .
@prefix dcterms: <http://purl.org/dc/terms/> . 

emp:Person rdf:type rdfs:Class ;
          rdfs:label "person" . 

emp:Employee a rdfs:Class ; 
            rdfs:subClassOf emp:Person ;
            rdfs:label "employee" . 

vcard:given-name  rdf:type rdf:Property ;
                  rdfs:domain emp:Person ;
                  rdfs:label "given name".

vcard:family-name rdf:type rdf:Property ;
                  rdfs:domain emp:Person ;
                  rdfs:label "family name" ;
                  rdfs:label "apellido"@es . 

emp:hireDate a rdf:Property ;
            rdfs:domain  emp:Employee ;
            rdfs:label   "hire date" ;
            rdfs:comment "The first day an employee was on the payroll."  ;
            rdfs:subPropertyOf dcterms:date . 

emp:reportsTo a rdf:Property ; 
             rdfs:domain emp:Employee ;
             rdfs:range  emp:Employee ;
             rdfs:label  "reports to" .

The first thing that this schema has that the earlier one didn’t is a triple saying that ex:Employee is a subclass of ex:Person. If an inferencing parser saw that employees ex:id2 (Heidi Smith) and ex:id3 (Jane Berger) are instances of the ex:Employee class, it would know that they were also instances of the emp:Person class.

Now that we know how to declare classes and indicate which is a subclass of another, we can build class hierarchies. These will be familiar to people who have used most modern programming languages. However, few if any of these programming languages let you also build property hierarchies. The schema above declares the emp:hireDate property to be a subproperty of the popular Dublin Core vocabulary’s dcterms:date property.

What does this buy you? For one thing, a tool that generates a user interface for this human resources data might not recognize the emp:hireDate property, but if it does the inferencing to find out that this property is a specialized version of the standard dcterms:date one, it might know that a date widget would be more appropriate to represent this field on an editing form than a plain text box.

The Turtle version of the RDFS schema for the Dublin Core DCMI Metadata Terms vocabulary includes nine triples with the predicate and object rdfs:subPropertyOf <http://purl.org/dc/elements/1.1/date>. These show us that properties such as dcterms:available, dcterms:created, and dcterms:dateAccepted are dates. You might guess that from a property named “dateAccepted”, but you wouldn’t know this about a “created” property without this machine-readable way to describe the semantics of that property. (I rarely use the term “semantics”, but when I do use it, I mean it.)

The next new thing to note in this schema, now that we’ve seen how to define relationships between classes, and between properties, is how this schema defines a relationship between a property and a class. The first rdfs:domain triple above associates the vcard:given-name property with the emp:Person class. (Remember that if emp:Employee is a subclass of emp:Person, then this property is now associated with emp:Employee as well.) Is there anything wrong with associating a property defined in a standard vocabulary with my own thing that I’m defining in my own vocabulary? Absolutely not; it’s actually a good thing, because it provides a standards-based context for the thing I’m defining for my own application.

As the W3C RDFS Recommendation tells us, “rdfs:domain is an instance of rdf:Property that is used to state that any resource that has a given property is an instance of one or more classes”. Given this, my schema is saying that if an RDF resource has a vcard:given-name property, then we can infer that that resource is an instance of emp:Person. (If this leads to an inference that the office dog is a person, I should re-evaluate my class hierarchy and which properties are associated with which classes.)

Sometimes we forget that RDFS and OWL were invented to enable this kind of inferencing across data found on the web. They were not invented to help us define data structures, but as I’ve shown, RDFS is handy to at least document them. Continuing with my user-interface-generation example, a system generating an edit form for an Employee instance would know from this schema’s rdfs:domain triplets that this editing form should include vcard:given-name, vcard:family-name, emp:hireDate, and emp:reportsTo fields. (And, as I mentioned last time, it should know that the form would be easier to read if these fields were labeled with the properties’ rdfs:label values and not the actual property names.)

Software developers who recognize the ability to define class hierarchies may be a bit confused by the relationship between classes and properties in RDFS. In standard object-oriented modeling, when you define a class, you define the properties used by that class, and some may be inherited from superclasses. In RDFS, you define classes and properties separately and then associate them, if you like, with the rdfs:domain property. (The fact that properties can have their own hierarchies is something else that can take object-oriented developers some time to get accustomed to.)

The rdfs:range property defined for the emp:reportsTo class is another way to define a relationship between a class and a property. According to the RDFS Recommendation, it “is used to state that the values of a property are instances of one or more classes”. We saw that if emp:reportsTo has an rdfs:domain of emp:Employee, then “X reports to Y” means that X is an emp:Employee; if emp:reportsTo has an rdfs:range of emp:Employee, we can infer from the same statement that Y is an emp:Employee—that is, that an employee reports to another employee. Even if we don’t plan on doing this kind of inferencing with rdfs:range, it’s still useful to indicate what kind of values to expect for a given property. For example, the application generating a form to edit employee data could generate a drop-down list of employee names on the “reports to” part of the form instead of a plain text box.

More support of interesting applications

I’ve written other blog entries about how I applied the ideas described above to various useful projects.

Drive a (mobile!) user interface

In Using SPARQL queries from native Android apps I describe how I used the MIT App Inventor toolkit to create a native Android app that lets the user pick a clothing product and a color and a size for that product before sending the selected information off to a web server. The choice of products, colors, and sizes are all stored in an RDFS model; screenshots from my phone show how the list of color choices expanded after I added a new one to the RDF schema that stored the model. This blog entry also describes how additions to the RDFS model (with no changes to the Android app) would enable support in the app for other spoken languages besides English.

Data integration

Driving Hadoop data integration with standards-based models instead of code describes a data integration demo that combines data from Microsoft’s SQL Server Northwind sample database with data from Oracle’s sample HR database. These databases both describe human resources databases but use different names (for example, LastName and last_name) for similar properties. Using Python and a SPARQL query, the demo collects data from the two sources and represents them using a common vocabulary. The system uses an RDFS model to both define this vocabulary and—and this part is crucial—to define the mapping from the two data sources to this vocabulary using the rdfs:subPropertyOf property mentioned above. After running the demo and expanding the RDFS model to cover more of the input, running the demo again integrates more of the source data with only that expansion of the model. No changes to the Python scripts were necessary.

All the ideas I’ve described about this project so far are pretty simple. The novelty of the article was that I set it all up to happen on a Hadoop cluster distributed across multiple systems, because that was especially hot at the time.

Because this article was written to accompany something I did for IBM Data Magazine, it doesn’t assume familiarity with RDF as much as other entries on my blog, so if you’re new to RDF that might be helpful.

Transform data with partial schemas

My more recent blog entry Transforming data with inferencing and (partial!) schemas describes how, if you have a big mess of more data than you need, an RDF schema for the subset of that data that you actually want can be very useful. This is especially true when you use inferencing to transform the data. I’ll quote the whole first paragraph of that blog posting here:

I originally planned to title this “Partial schemas!” but as I assembled the example I realized that in addition to demonstrating the value of partial, incrementally-built schemas, the steps shown below also show how inferencing with schemas can implement transformations that are very useful in data integration. In the right situations this can be even better than SPARQL, because instead of using code—whether procedural or declarative—the transformation is driven by the data model itself.

Here’s another paragraph from after the piece walks through the demo:

This idea of letting the data and its schema evolve in a more flexible manner is especially great for data integration projects. My example here started off with a (somewhat) big mess of RDF; if you’re working with more than one RDF dataset—maybe with some converted from other formats such as JSON or relational databases—then the use of RDFS to identify little subsets of those datasets and to specify relationships between components of those subsets can help your knowledge graph and the applications that use it become useful a lot sooner.

Again, you’ll see many of the techniques outlined in today’s blog post put to good use in that project.

A bit more useful background

When using certain standards, it’s easy to assume that the standard itself is a batch of long, technical jargon. The W3C RDF Schema Recommendation is not very long and actually quite readable, as I wrote in RDFS: The primary document, so I recommend it. The RDF Schema Wikipedia page also summarizes what RDFS offers and what kinds of things you can do with it quite nicely.

I have been referring to inferencing quite casually, although my “Data integration” and “Transform data with partial schemas” examples do go into more detail about actually executing that. You may also find Living in a materialized world useful; this covers the potential role and mechanics of RDFS inferencing.

And, Hidden gems included with Jena’s command line utilities describes how an open source multi-platform Apache Jena tool can perform RDFS inferencing for you.

Let me know how you end up using RDFS! There is a lot of potential there that has been unused for too long.

Comments? Reply to my tweet announcing this blog entry.

CC BY 2.0 photo by Howard Duncan