bobdc blog

Visualizing RDF

Bob DuCharme — Sun, 24 Mar 2024 04:22:00 -0500

I see nodes and edges...

I recently did a review of options for creating visual representations of RDF data. I didn’t just want a general visualization tool, but something that understood RDF enough to represent class instances and literal values differently. I will emphasize instances because several tools out there can read RDF schema or ontologies and create a visualization of classes and their relationships and potential properties, but I want to see instances with their property values.

My favorite ended up being the RDF Shape tool from the University of Oviedo’s WESO group in northern Spain. I also liked RDF Grapher from the Linked Data Finland project. Both let me create SVG files that I can edit with the Inkscape editor if I don’t like their algorithmic layout of a particular dataset’s RDF graph nodes. Before I go into detail about that and demonstrate some Inkscape editing, I wanted to describe some of the research I did to get there.

The W3C RDF Validation Service can create visualizations that you may recognize from W3C publications. It has been available for at least sixteen years, according to the “Last modified” date at the bottom of the page. It lets you paste some RDF into a field or enter the URL of an RDF dataset in another field, and then after you set the “Triples and/or Graph” field to include a Graph in its output, clicking the Parse button generates the image.

To test the various graph generation tools I used the ex012.ttl sample data from my book Learning SPARQL and then added a few schema.org follows triples to connect up the three people in the sample data:

d:i0432 schema:follows d:i9771. 
d:i8301 schema:follows d:i0432, d:i9771.

The W3C RDF Validation Service created this image from that data.

The text labels are small enough to be illegible. As we’ll see, if a tool can generate SVG, like the W3C RDF Validation Service can, editing the image with an SVG editing tool might make it narrower so that it displays the labels at a more readable size (“Scalable Vector Graphics!”). For this image, though, that’s just too much editing.

An image like this is pretty common with the W3C RDF Validation Service. Also, the input must be RDF/XML, so that was another reason to look for new alternatives.

A Stack Overflow discussion provided a good starting place for research into alternatives. Some alternatives were more focused on visualizing schema and ontology classes, as I mentioned above, and others were general-purpose visualization tools that had an RDF plugin available that may or may not be up to date with the latest version of the visualization tool.

This list is where I found out about WESO RDF Shape. To learn more about that project, see its About page.

The “Data analysis and visualization” link on the RDF Shape page leads to the Data analysis form where you can paste some RDF in just about any serialization, click the blue Analyze button, and then click the Visualizations tab that appears with the result.

RDF Shape did this with the data that I used to make the previous image:

I don’t love the yellow. I could edit each individual square with the Inkscape editor and change their color, but because it’s SVG, it’s XML, which means that I could edit that directly with a text editor. I globally replaced the polygon/@fill values of “#ffff00” with “#8aeaea” and then the graph looked like this:

If you prefer a certain style of font, rectangle fill colors, or oval and rectangle outline colors, a little XSLT or even perl could turn the default RDF Shape SVG into whatever you like with similar replacements.

Using Inkscape for more hands-on editing, I moved a few shapes and arrows in that last image to make the image narrower so that it (and especially its text) can be displayed bigger, which makes the whole thing easier to read:

To learn how to do these kinds of edits with Inkscape, I started with their Basic tutorial and then skipped around in the sections of their Beginners’ Guide. For editing the lines with the arrows, the section Editing Paths with the Node Tool was helpful.

I doubt that I know 5% of what Inkscape can do. Instead of writing out the important parts that I learned so that I could make the edits mentioned above, I just made a two-minute demo video:

I mentioned that in addition to the WESO RDF Shape tool, I also liked the RDF Grapher tool. Here is the RDF Grapher version of the same data as an SVG image:

Overall it’s similar to the RDF Shape version, and you have similar options for editing its SVG XML directly (for example, those five lines of text at the bottom were easy to find in the SVG XML and delete) or using Inkscape like I did in the video.

Have you found any RDF visualization tools that you really like?

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Using regular expressions to manipulate data in a SPARQL query

Bob DuCharme — Sun, 25 Feb 2024 10:58:00 -0500

A pure, standards-compliant SPARQL query.

I have often lamented that SPARQL’s REGEX function only returned a boolean value. It’s handy in FILTER tests because it lets you use regular expressions to create more complex conditions about the results that you do or don’t want returned by a query, but instead of just returning True or False I wished that it would let me grab the pieces of a string that match the regular expression pattern and recombine them into new values, like I can with the regular expression support of most programming languages.

I only recently noticed that SPARQL’s REPLACE function, which comes right after REGEX in the SPARQL query specification, supports regular expressions, so I can do this regex string manipulation in SPARQL after all.

One of those other languages is JavaScript. In Calling your own JavaScript functions from SPARQL queries I showed how once you write a JavaScript function that does some regex string manipulation, you can then call that function from a SPARQL query being executed with Jena ARQ. (Soon I’ll be showing how to do that with GraphDB on the Ontotext blog.) The demo in my earlier blog entry used a regular expression in a JavaScript function to normalize some U.S. phone numbers.

The SPARQL query below demonstrates why I didn’t need to call those JavaScript functions. Using SPARQL’s REPLACE function and the same input data as that demo, I can normalize the same phone numbers using nothing but pure W3C-compliant SPARQL.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX v:    <http://www.w3.org/2006/vcard/ns#>

SELECT ?name ?phoneNum ?fixedPhone
WHERE {
    ?s v:given-name ?name ;
  v:homeTel ?phoneNum .
  BIND (replace(?phoneNum,".*(\\d\\d\\d).*(\\d\\d\\d).*(\\d\\d\\d\\d).*",
                "$1-$2-$3") AS ?fixedPhone)
}

The regular expression in the replace() function call’s second argument looks for two three-digit sequences and then a four-digit sequence, ignoring everything before, after, or in between. Then it returns the found strings separated by hyphens.

Here is the sample data from that earlier blog entry; note the different punctuation and spacing used with the four phone numbers:

@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix d: <http://learningsparql.com/ns/data#> .

d:i9771 v:given-name "Cindy" ;
        v:homeTel "1 (203) 446-5478" .

d:i0432 v:given-name "Richard" ;
        v:homeTel "   (729)556-5135   " .

d:i8301 v:given-name "Craig" ;
        v:homeTel "9232765135" .

d:i8309 v:given-name "Leigh" ;
        v:homeTel "843-5544" .

The result after running the query above with this data shows the phone numbers from the data and the results of the replace() calls:

name	phoneNum	fixedPhone
Craig	9232765135	923-276-5135
Leigh	843-5544	843-5544
Richard	(729)556-5135	729-556-5135
Cindy	1 (203) 446-5478	203-446-5478

As the SPARQL query spec tells us, this function corresponds to the XPath fn:replace function. That leads to more documentation, which points to a separate Regular expression syntax section that lists available flags such as i for case-insensitive matching and m for multiline matching.

Those links ultimately lead to an escape character table in the XML Schema Part 2 specification. This table tells us the typical regular expression codes—for example, that \s matches white space characters and \d matches a numeric digit. Note that when I used the \d codes in the SPARQL query above they’re in a quoted string, so the backslash itself needed escaping; that’s why you see two backslashes before each d in my query’s regular expression.

The REPLACE function’s ability to find substrings and delete or rearrange them in RDF literal data should be very handy for data cleanup and enhancement. I’m sorry I didn’t notice it before!

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Excerpt from xkcd comic by Randall Monroe, CC BY-NC 2.5 DEED.

Appreciating the SPARQL property path slash character more

Bob DuCharme — Sun, 21 Jan 2024 10:25:00 -0500

Querying for labels and more.

I’ve understood SPARQL’s property path features well enough to demo them in the “Searching Further in the Data” section of my book Learning SPARQL. (See example files ex074 - ex085.) To be honest, I have very rarely used them in actual queries that I’ve written. I’ve only just realized how the property path slash operator can help with a pattern that I have used in a large percentage of my queries. It makes these queries more concise and removes at least one variable that would not have been in my SELECT statement anyway.

As an example, here is some very simple data about three people and who follows who on social media:

@prefix schema: <http://schema.org> .
@prefix d:  <http://learningsparql.com/ns/data#> .

d:i0432 d:name "Richard Mutt" . 
d:i9771 d:name "Cindy Marshall" . 
d:i8301 d:name "Craig Ellis" . 

@prefix schema: <http://schema.org/> .
d:i0432 schema:follows d:i9771, d:i8301.

If I want to list who Richard follows, I want to list their actual names, not their URIs. This would be an obvious query to do that:

PREFIX d:      <http://learningsparql.com/ns/data#>
PREFIX schema: <http://schema.org/> 

SELECT ?name WHERE {
  
  ?follower d:name "Richard Mutt" ;
            schema:follows ?person .
  
  ?person d:name ?name .
}

It finds the URIs of the people that Richard follows, stores them in the ?person variable, and then finds the d:name value of each of those people. Having a query find resources that meet a certain condition and then using another triple pattern to get the human-readable names of those resources (and then using those names in the SELECT statement) is extremely common in SPARQL.

The property path slash character lets me do the same thing with no need for the ?person variable in the previous query. This next query asks, for each resource that Richard follows, what their name is:

PREFIX d:      <http://learningsparql.com/ns/data#>
PREFIX schema: <http://schema.org/> 

SELECT ?name WHERE {
  ?follower d:name "Richard Mutt" ;
            # For each followed resource, what is its name?
            schema:follows/d:name ?name . 
}

In graph terms, we store the URI of Richard Mutt’s node in the ?follower variable, then traverse schema:follows graph edges to any nodes that then have a d:name edge, and then we store each value that the d:name edge leads to in the ?name variable.

I don’t think that it’s intuitively very readable, which is why I added the comment in the query, but perhaps as I use this more I will get used to it. (Note also that the comment doesn’t ask “What is the name of each followed resource?”; I wanted it to reflect the syntax it describes a little more closely.)

This is such a common pattern that I wanted to show some examples from more real-life contexts. The following query asks Wikidata for the names of the members of Daft Punk. It does this by storing the URI representing each member of the group in the ?member variable, and it then asks for the rdfs:label value of each, filtered to only show the English representation. (You can execute this query with the Wikidata Query Service yourself.)

PREFIX wd: <http://www.wikidata.org/entity/>

SELECT ?name WHERE {
  wd:Q185828 wdt:P527 ?member . 
  ?member rdfs:label ?name . 
  FILTER(lang(?name) = "en")
}

But, we don’t need that ?member variable and second triple pattern! We can just do this:

PREFIX wd: <http://www.wikidata.org/entity/>

SELECT ?name WHERE {
# For each member of Daft Punk, what is their name? 
  wd:Q185828 wdt:P527/rdfs:label ?name . 
  FILTER(lang(?name) = "en")
}

Run this second query and you will see the same results as the query before it.

I could do this with something besides names, such as their birth dates, but a list of dates with no context about what resources they describe isn’t very helpful. (Using it for names also just happens to build on a theme of recent entries in my blog, Human-readable names in RDF and Querying for labels.)

As another example, I was going to create a query for the Rhizome Artbase SPARQL endpoint that I wrote about in Generating websites with SPARQL and Snowman, part 1. Then, I realized that I could use a query that was already in that blog entry, which you can run yourself:

PREFIX rt: <https://artbase.rhizome.org/prop/direct/>

SELECT DISTINCT ?artistName WHERE {
  ?artwork rt:P29 ?artist . 
  ?artist rdfs:label ?artistName .
}
ORDER BY (?artistName)
LIMIT 250

This time, we’ll remove the ?artist variable from the end of the first triple pattern and the beginning of the second and create a property path out of rt:P29 and rdfs:label:

PREFIX rt: <https://artbase.rhizome.org/prop/direct/>

SELECT DISTINCT ?artistName WHERE {
  ?artwork rt:P29/rdfs:label ?artistName .
}
ORDER BY (?artistName)
LIMIT 250

Run this one and you’ll see the same result as the previous query.

PREFIX rt: <https://artbase.rhizome.org/prop/direct/>

SELECT * WHERE {
  ?artist rdfs:label "Jessica Gomula"@en . 
  ?artwork rt:P29 ?artist .
  ?artwork rdfs:label ?name . 
}

Has anyone else found a particular property path pattern to be worth using in a high percentage of their SPARQL queries?

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

CC BY-SA 2.0 photo by Dineshraj Goomany

Triples about existing triples

Bob DuCharme — Sun, 17 Dec 2023 10:35:00 -0500

The easy way and the hard way.

Triples about existing triples

Several years ago in the blog post RDF* and SPARQL* I described how I had played with implementations of the new reification syntax that Olaf Hartig and Bryan Thompson proposed in their paper Foundations of an Alternative Approach to Reification in RDF. I found the new syntax to be straightforward and useful. As you can see from the recent W3C Community Group Report RDF-star and SPARQL-star, this syntax has progressed—with a more search-engine-friendly spelling of the spec’s name—closer to W3C standardization. (You’ll also see me listed as an author of that specification; I merely did a pull request that revised the tutorial from an earlier draft, so I was honored to be co-credited on that document.)

Because of the advancing specification, the wider implementation, and some potential syntax trickiness for situations that I would consider to be edge cases, I wanted to first review the current syntax that I feel will be the most popular and then review the potentially tricky part that I think most people can ignore. (I realized that the second part of my subtitle of “The easy way and the hard way” could imply the original reification syntax from years ago, but I think we can all put that behind us.)

The simple way: annotation syntax

The simple way is called annotation syntax, which as far as I know did not exist yet when I did my earlier experiments with RDF-Star and SPARQL-Star. Using the Turtle-Star syntax, if you have a triple that expresses a statement and you want to record other triples about that triple in annotation syntax, you put them after it inside of {| and |} delimiters.

Here is the example from that earlier blog entry expressed in annotation syntax. It has three triples that I got from Olaf’s slides that the blog entry linked to:

One triple saying that (Stanley) Kubrick was influenced by (Orson) Welles.
Another saying that triple 1 has a significance of 0.8.
A third one saying that triple 1 has its source at a URL at nofilmschool.com.

@prefix d: <http://www.learningsparql.com/ns/data/> .

d:Kubrick d:influencedBy d:Welles {| 
   d:significance 0.8 ;
   d:source <https://nofilmschool.com/2013/08/films-directors-that-influenced-stanley-kubrick>
|} .

Using Apache Jena arq or the free version of Ontotext’s GraphDB, a SELECT * WHERE {?s ?p ?o} query to get all the triples in that block of Turtle-Star retrieves this:

?s                                      ?p             ?o
--------------------------------------- -------------- ------------------
d:Kubrick                               d:influencedBy d:Welles .
<< d:Kubrick d:influencedBy d:Welles >> d:significance "0.8"^^xsd:decimal . 
<< d:Kubrick d:influencedBy d:Welles >> d:source https://nofilmschool.com/2013/08/films-directors-that-influenced-stanley-kubrick .

It’s the three triples from the numbered list above.

To understand better what this syntax adds, here is the sample data from my earlier blog entry on this topic:

@prefix d: <http://www.learningsparql.com/ns/data/> .
<<d:Kubrick d:influencedBy d:Welles>> d:significance 0.8 ;
      d:source <https://nofilmschool.com/2013/08/films-directors-that-influenced-stanley-kubrick> .

The same query on this data will show the second and third result rows above but not the first one. In other words, this data doesn’t actually say that Kubrick was influenced by Welles; it only has metadata about this statement.

When you use annotation syntax in a SPARQL query, you’re using SPARQL-Star. To let me make my next SPARQL-Star query a little more interesting, I added the following data to the triples above:


d:Scorsese d:influencedBy d:Rosselini {| 
   d:significance 0.9 ;
   d:source <https://en.wikipedia.org/wiki/Martin_Scorsese>
|} .

d:Tarantino d:influencedBy d:Scorsese .

For which director influence triples do we have annotations about the significance of that influence?

PREFIX d: <http://www.learningsparql.com/ns/data/> 

SELECT ?director
WHERE {
  
  ?director d:influencedBy ?o {|
      d:significance ?significanceScore
  |} .

}

The result:


?director
-----------
d:Scorsese
d:Kubrick

It’s all pretty simple, until we get to…

Quoted and asserted triples

The original proposal that I mentioned in the first paragraph above did not mention the concepts of quoted or asserted triples until its authors later added a “This document has become obsolete” paragraph at the top. In the latest version of the specification, the first subsection of the Concepts and Abstract Syntax section is titled Quoted and Asserted Triples and includes this:

A quoted triple is a triple used as the subject or object of another triple. Quoted triples can also be called “embedded triples”.

in RDF 1.1, an asserted triple is an element of the set of triples that make up an RDF graph. RDF-star does not change this except that an RDF-star triple can contain quoted triples. A triple can be used as an asserted triple, a quoted triple, or both, in a given graph.

This tells me that the regular triples that we’ve been using all along are now known as asserted triples, and the new kind—the kind that can be used as the subject or object of another triple—are known as quoted or embedded triples. (I did enjoy this quote from the Community Group Report after it used a Lisp analogy to explain the difference between asserted and quoted triples: “Obviously this way of thinking is helpful only if you understand how Lisp works”.)

Here is an example. The following Turtle translated to plain English says “Sam said that the earth is flat”.

@prefix d: <http://learningsparql.com/ns/data#> .

d:sam d:said << d:earth d:shape "flat" >> .

It does this with:

An asserted triple that tells us that Sam said something.
A quoted triple that tells us what he said: that the earth is flat.

That second one is a quoted triple because it’s used as the object of the first triple, and the << >> delimiters show us that it’s a quoted triple.

If I do a SELECT * WHERE {?s ?p ?o} query on this data to get of that examples’ triples, this is all I will see:

--------------------------------------------------------
| s       | p        | o                                |
========================================================
| d:sam   | d:said   | << d:earth d:shape "flat" >>     |
--------------------------------------------------------

What if I do a query asking for triples about the earth’s shape, like this?

PREFIX d: <http://learningsparql.com/ns/data#>

SELECT *
WHERE {
  d:earth d:shape ?earthShape
}

I won’t get any response. That data has no asserted triples about the earth’s shape.

If I wanted this earth-is-flat triple to be both an asserted triple and a quoted triple, I can record it as both:

@prefix d: <http://learningsparql.com/ns/data#> .

d:sam d:said << d:earth d:shape "flat" >> .
d:earth d:shape "flat" .

Example 7 in the Community Group Report also demonstrates this.

I don’t like this redundancy because maintaining a thing and a separate copy of the thing is usually a bad idea. If you edit one, then maybe you do or don’t need to make the same edit to the other, and maintenance gets messy. That’s why it was nice to see that the first section of the Community Group Report’s Concrete Syntaxes section is Annotation Syntax, which I described above as the simpler way to just have triples about triples without some of those triples having a special status that prevents them from showing up as the result of an ?s ?p ?o query.

I’m sure that having this separate status be a part of the architecture will enable some finer-grained modeling. To just have triples about triples (especially to express data about edges between graph nodes, which was a key inspiration for all of this), I’m happy with the annotation syntax for now.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Querying for labels

Bob DuCharme — Sun, 19 Nov 2023 11:20:00 -0500

The normal way and the wikibase:label service way

In my last blog entry I discussed various ways that different RDF datasets assign human-readable labels to resources, with the rdfs:label property being at the center of them all. I mentioned how schema.org doesn’t use rdfs:label but its own equivalent of that, schema:name, which its schema declares as a subproperty of rdfs:label. Since I wrote that, Fan Li pointed out that Facebook’s Open Graph protocol also has their own equivalent: og:title, which you can see used in the HTML source of IMDB, Instagram, and yelp. (I tried pointing each of those three links to the view-source version of the pages, and that didn’t work, so you’ll have to take the extra step with each to view their source and see each one’s og:title value.) This also gets defined as a subproperty of rdfs:label in the OGP schema, so a serious RDFS application could parse that schema and then treat og:title values as rdfs:label values.

Treating those rdfs:label variations as rdfs:label values

Querying for rdfs:label values is simple enough. To demonstrate how a query for rdfs:label values will retrieve og:title and schema:name values when a query engine that can do inferencing has access to the Open Graph Protocol and schema.org schemas, I added some of those values to the following document with comments about where I found each. (Where I found them they were not in Turtle syntax like they are here, but they were in machine-readable formats that could easily be converted to Turtle.)

Sample data:

@prefix og: <http://ogp.me/ns#> .
@prefix schema: <https://schema.org/> .

# og:title examples

<https://www.imdb.com/title/tt22041854/?ref_=ttls_li_tt>  
  og:title "Priscilla (2023) ⭐ 6.9 | Biography, Drama, Music" . 

<https://www.instagram.com/bobdcofficial/> 
  og:title " (&#064;bobdcofficial) &#x2022; Instagram photos and videos" . 

<https://www.yelp.com/biz/peter-changs-china-grill-charlottesville> 
  og:title "Peter Chang's China Grill - Charlottesville, VA" . 

# schema:name examples

## (added by Hugo as a default with no special configuration from me)
<https://www.bobdc.com/blog/rdflabels/> 
  schema:name "Human-readable names in RDF" . 

<https://www.newyorker.com/best-books-2023> 
  schema:name "The Best Books We Read This Week" . 

<https://www.landsend.com/products/mens-super-t-long-sleeve-t-shirt/id_130670> 
  schema:name "Men's Super-T Long Sleeve T-Shirt" .

I downloaded the schema.org and OGP schema files and combined them into a single schema file:

cat ogp.me.ttl schemaorg-current-https.ttl > comboschema.ttl

Then, as I described in Hidden gems included with Jena’s command line utilities, I used the Jena riot tool to do RDFS inferencing with the data above and the combined schemas. It produced a lot of triples, so I used grep to only show the ones that mentioned the rdfs:label value:

riot --rdfs comboschema.ttl labeldata.ttl | grep "#label"

It produced these results:

<https://www.imdb.com/title/tt22041854/?ref_=ttls_li_tt> <http://www.w3.org/2000/01/rdf-schema#label> "Priscilla (2023) ⭐ 6.9 | Biography, Drama, Music" .
<https://www.instagram.com/bobdcofficial/> <http://www.w3.org/2000/01/rdf-schema#label> " (&#064;bobdcofficial) &#x2022; Instagram photos and videos" .
<https://www.yelp.com/biz/peter-changs-china-grill-charlottesville> <http://www.w3.org/2000/01/rdf-schema#label> "Peter Chang's China Grill - Charlottesville, VA" .
<https://www.bobdc.com/blog/rdflabels/> <http://www.w3.org/2000/01/rdf-schema#label> "Human-readable names in RDF" .
<https://www.newyorker.com/best-books-2023> <http://www.w3.org/2000/01/rdf-schema#label> "The Best Books We Read This Week" .
<https://www.landsend.com/products/mens-super-t-long-sleeve-t-shirt/id_130670> <http://www.w3.org/2000/01/rdf-schema#label> "Men's Super-T Long Sleeve T-Shirt" .

So, asking for the rdfs:label values when the schemas were available retrieved the schema:name and og:title values because they were subproperties of rdfs:label and because I used a query engine that could do inferencing. (When I created a repo that would do RDFS inferencing with the free version of GraphDB, the same thing happened. Standards!)

Some extra help from the Wikidata Query Service

Querying for an rdfs:label value in Wikipedia can be simple enough:

PREFIX wd:   <http://www.wikidata.org/entity/>

SELECT * WHERE {
   wd:Q144 rdfs:label ?name
}

Doing this in Wikidata, though, gets about 300 results (and the number has gone up since I first drafted this blog entry) because Wikidata knows the word for “dog” in so many languages. We could FILTER it down to one or just a few languages like this:

PREFIX wd:   <http://www.wikidata.org/entity/>

SELECT ?label WHERE {
   wd:Q144 rdfs:label ?label
   FILTER (lang(?label) IN ("en","es"))
}

Wikidata has a special service to make this easier. To demonstrate it, let’s say I’m wondering about the topics of the Wikiquote pages https://en.wikiquote.org/wiki/Dogs and https://en.wikiquote.org/wiki/Cats (although it’s pretty clear from the URLs). The following query, which you can try on the Wikidata Query Service, will show me a ?foo value of wd:Q144 and and ?bar value of wd:Q146, which are not very informative:

SELECT ?foo ?bar
WHERE {
  { <https://en.wikiquote.org/wiki/Dogs> schema:about ?foo }
  UNION
  { <https://en.wikiquote.org/wiki/Cats> schema:about ?bar }
}

I could ask for rdfs:label values of ?foo and ?bar, but instead I’ll use the wikibase:label service built in to the Wikidata Query Service. This not only looks up the labels but even creates variables for them by adding “Label” to the names of the variables representing the resources that I’m querying about:

SELECT ?fooLabel ?barLabel
WHERE {
  { <https://en.wikiquote.org/wiki/Dogs> schema:about ?foo }
  UNION
  { <https://en.wikiquote.org/wiki/Cats> schema:about ?bar }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]" } 
}

Running that query gives us the following results:

fooLabel    barLabel
--------    --------
dog         house cat

I could name a specific language if I wanted; running the next one shows a ?fooLabel value of “Hund” and a ?barLabel value of “Hauskatze”.

SELECT ?fooLabel ?barLabel
WHERE {
  { <https://en.wikiquote.org/wiki/Dogs> schema:about ?foo }
  UNION
  { <https://en.wikiquote.org/wiki/Cats> schema:about ?bar }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de" } 
}

A neat Wikidata Query Service trick that I only recently learned about is how the web interface lets you reset the default language. If I click on “English” in the upper right of the query screen I get a drop-down, searchable list of languages. If I pick “español” from this list, the query screen’s “Examples” button gets renamed as “Ejemplos”, “Help” becomes “Ayuda”, and so forth with the rest of the UI. When I run the [AUTO_LANGUAGE] query from above after doing this, it shows a ?fooLabel value of “perro” and a ?barLabel value of “gato doméstico”.

With a made-up language code of “xyz” that it doesn’t recognize, it gives me the Q names from the ?foo and ?bar values as ?fooLabel and ?barLabel values:

fooLabel  barLabel
--------  --------
Q144      Q146

The wikibase:label service is not standard SPARQL, but with the tremendous amount of multi-lingual data available in Wikidata, it adds a lot of convenience that can trim down the length of your Wikidata queries.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

CC BY 2.0 photo by F Delventhal

Human-readable names in RDF

Bob DuCharme — Sun, 29 Oct 2023 11:04:00 +0000

Sometimes simple, sometimes not.

rdfs:label

First, reviewing some basics before I discuss the edge cases: resources in RDF are represented by URIs, and the spelling of a given URI often provides no clues about what the URI represents. For example, you wouldn’t know from looking at http://www.wikidata.org/entity/Q144 that it represents “dog” as a Wikipedia topic. (We’ll see below that this is a for a good reason.)

Subject-predicate-object triples use predicate-object pairs to describe the resources represented as URIs by each subject. (We sometimes forget that RDF stands for “Resource Description Framework”.) The most popular predicate is the one that gives us a human-readable name to tell us what resource the URI represents: rdfs:label. People typically use it to assign an identifying name to a resource.

You can optionally add a language tag to indicate the spoken language of the label value. Assigning multiple terms in different languages to the same resource makes it easier to build multi-lingual applications.

@prefix wd: <http://www.wikidata.org/entity/> . 

wd:Q144 rdfs:label "dog"@en . 
wd:Q144 rdfs:label "perro"@es .

This also reminds us why it’s a bad practice to include descriptive text as part of the URI: including “dog” in the URI http://www.wikidata.org/entity/Q144 would only help people who know English, and including “perro” would only help people who know Spanish.

schema:name

While most schemas and ontologies are built around a specific domain such as a business sector or an academic discipline, the very successful schema.org is much broader, covering many aspects of ordinary life and commerce. Unlike most other vocabularies, schema.org does not use rdfs:label for names, but its own schema:name property instead. The discussion What is the difference between schema:name and rdfs:label? on a schema.org development issue page explains why: many processors that can read schema.org data from a web page won’t know about RDF and won’t recognize rdfs:label. As part of that discussion, Dan Brickley mentions adding a subPropertyOf assertion to the definition of schema:name, which we see right in the property’s definition in the RDFS schema that you can download from the schema.org Developers page:

schema:name a rdf:Property ;
    rdfs:label "name" ;
    rdfs:comment "The name of the item." ;
    rdfs:subPropertyOf rdfs:label ;
    owl:equivalentProperty dcterms:title ;
    schema:domainIncludes schema:Thing ;
    schema:rangeIncludes schema:Text .

This is a perfect response to RDF geeks who complain that schema.org should have used rdfs:label instead of making up its own schema:name property—for a system that can parse full RDF and do even minimal inferencing, a schema:name value counts as an rdfs:label value. It says so right on the fourth line of the above excerpt.

A brief detour: dc:title and skos:prefLabel

The schema excerpt above also includes an assertion that schema:name is an owl:equivalentProperty to the Dublin Core dcterms:title property. The Dublin Core vocabulary is almost as old as the web itself, predating schema.org by sixteen years. That vocabulary’s specification describes both dcterms:title and the property of which it is a subproperty, dc:title, as “A name given to the resource”, which supports Dan’s note that the schema:name property means the same thing as dc:title.

I think of the Dublin Core terms as slightly narrower than that. The Wikipedia page for Dublin Core describes it as a set of “metadata items for describing digital or physical resources”, which aligns it with rdfs:label, but Dublin Core was first developed in response to the rapidly expanding ideas of what constituted “publishing” in the early days of the web, so I’ve always thought of it as by and for the publishing industry. (I once took part in a standards group that developed standards more specifically for the magazine industry, and when they needed separate properties for a given issue’s publication date and newsstand date, making each a subproperty of dcterms:date was a perfect use case for RDFS subproperties.) I suppose the word “title” also makes me think of a label for a book, musical album, or other published work.

The SKOS skos:prefLabel property, which names something’s preferred label (as opposed to alternative or hidden labels, which are additional SKOS properties) may seem equivalent to rdfs:label. I don’t think of it as suitable for just any existing or imaginary resource, the way rdfs:label is, but instead for for naming concepts within the taxonomies and thesauruses that SKOS was designed to help manage. The SKOS specification does say that it’s a subproperty of rdfs:label, so this supports the idea that it’s a specialized version of that, but the actual SKOS schema shows that skos:prefLabel does not have an rdfs:domain of skos:Concept (that is, it’s not defined as being used only for describing labels of Concepts) as I had expected. Still, it was defined as part of SKOS, and SKOS is about managing vocabulary terms and their relationships and other metadata, with concepts being the central organizing unit for managing these terms and their metadata.

One person’s SKOS taxonomy might be another person’s hierarchical class structure; converting one to the other with a SPARQL CONSTRUCT query has helped many people take advantage of available data that otherwise wasn’t a perfect fit for their system. This typically means converting between concept skos:prefLabel values and class rdfs:label values.

How do we query for all these types of labels? Generally, the same way we query for any other RDF values, but in my next blog entry I’ll talk about a built-in special service in Wikidata that lets you replace several lines of label-retrieving SPARQL code with a single line.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

CC BY 2.0 photo by F Delventhal

My brief tenor banjo career

Bob DuCharme — Sun, 24 Sep 2023 10:50:00 +0000

Brief but symphonic.

During the first pandemic summer I asked my wife for a cheap tenor banjo for my birthday. These are tuned like a viola and smaller than the traditional five-string banjos used for bluegrass. Instead of fingerpicking patterns with the right hand like bluegrass banjos are famous for, tenor banjo players strum chords with a pick for volume as a rhythm instrument. When you hear a banjo in old-timey jazz, that’s a tenor. (Around the time the ground was being laid for bebop, Charlie Christian showed that an amplified electric guitar could do a lot more than a banjo, so banjos faded from use in jazz groups outside of trad and Dixieland circles.) It was also fun knowing that the chords that I learned for the tenor banjo could work on the viola; composer Jessie Montgomery’s wonderful piece Strum was also an inspiration toward thinking about plucking chords on this otherwise bowed instrument.

When the Waynesboro Symphony Orchestra was rehearsing William Grant Still’s Afro-American Symphony in February of last year, after the third movement the conductor Peter Wilson told us that the actual performance would include a harp, a xylophone, and a banjo. On the next break I went up to him and said “that would be a tenor banjo, right? Because of the old time jazz effect? If so, I have one.” (“Rhapsody in Blue” by Still’s friend and arranging student George Gershwin also includes a tenor banjo.) He said “bring it!”

Here is that movement as we performed it at Waynesboro’s First Presbyterian Church. To make it easier for the audience to hear the banjo, Peter had me stand practically right in front of him for this movement. (You will see links in YouTube to the other movements, where I am in the back row of the violas.)

Tenor Banjo Chord Cheat Sheet

To get to know the instrument when I first got it I went through several books and many Internet charts of chords to learn. I eventually scrawled a single-page chart of the chord forms that I thought were most useful. I wanted to share that with others, but I didn’t want to share the actual scrawl, so I made a neater PDF version of my Tenor Banjo Chord Cheat Sheet. The first three rows show a few forms of major, minor, and (dominant) seventh chords, which should be enough for people doing most simple songs. The last three rows show half-diminished, sharp 5, and diminished chords for people playing more typical jazz standards.

The chart shows all chords in the open position so that at least one string is played without any left-hand fingers pressing it down, but they all work as bar chords if you play them higher on the neck and press your left first finger across the neck where you see open strings in each chord diagram. I also wrote “R” in each chord diagram under any strings that are playing the root of the chord. That way, if I need (for example) a B flat minor chord, which is not shown on the chart, I can just find the note B flat somewhere on the neck and then pick the minor chord shape from the cheat sheet that is built around the chord’s root being on that string.

I hope this chart can help someone else who decides to try this fun instrument with an important role in early jazz history. My one live gig with it was certainly an interesting experience.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Nicer date and time handling in SPARQL 1.2

Bob DuCharme — Fri, 18 Aug 2023 10:25:00 +0000

Add, subtract, and ADJUST() dates and times.

SPARQL 1.1 has been with us for about ten years. Work on SPARQL 1.2 is currently underway, and one nice set of improvements will let us do much more with date and time values.

I didn’t realize how minimal SPARQL 1.1’s ability to handle these was until I saw the introductory material in the Add Support Durations, Dates, and Times issue recently added to the SPARQL 1.2 development discussion. I had never noticed how the SPARQL 1.1 Recommendation explicitly says that it supports the xsd:dateTime data type without mentioning xsd:date or any of the other related date and time data types.

With this support, SPARQL 1.1 lets you compare xsd:dateTime values so that you can filter for events that occur before or after a particular point in time (see this example from my book Learning SPARQL) or between two points in time. SPARQL 1.2 will add many related options, including the ability to do date arithmetic.

My experiments with subtracting one date or date-time value from another had different levels of success with different SPARQL processors because some of these processors have added degrees of support just because it was useful to have, even though it wasn’t explicitly required by the standard. One example is Ontotext’s added support for date and time manipulation. Another is the way that Wikidata’s SPARQL endpoint can subtract xsd:date values, although it returns a decimal number; with Ontotext and the proposed 1.2 standard it returns a duration value. Having more extensive support for working with dates and times right in the SPARQL 1.2 standard will ensure that all the SPARQL processors support it and that they all use a consistent syntax and return consistent data types. This is why we use standards!

The proposed 1.2 additions will also let us add and subtract durations from xsd:time and xsd:dateTime values. And, it will give us a new function that builds on these: ADJUST(), which adjusts xsd:dateTime values based on their time zones.

The latest release of Apache Jena supports this new function, so I tried it out. The following Turtle data shows the start and end time of a meeting that takes place in the New York City time zone:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix d:   <http://learningsparql.com/ns/data#> .
@prefix t:   <http://purl.org/tio/ns#> . 

d:meeting1 t:starts "2023-10-14T12:30:00-05:00"^^xsd:dateTime ;
           t:ends   "2023-10-14T15:00:00-05:00"^^xsd:dateTime .

The query below uses the ADJUST() function to calculate the Los Angeles start and end time of the same meeting. The time zone of the New York meeting was indicated with -05:00, so the LA equivalent is -08:00:

PREFIX t: <http://purl.org/tio/ns#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?NYCStartTime ?LAStartTime
WHERE
{
  ?mtg t:starts ?NYCStartTime . 
  BIND (ADJUST(?NYCStartTime, xsd:dayTimeDuration("-PT8H")) AS ?LAStartTime)
}

The result:

-----------------------------------------------------------------------------------------
| NYCStartTime                              | LAStartTime                               |
=========================================================================================
| "2023-10-14T12:30:00-05:00"^^xsd:dateTime | "2023-10-14T09:30:00-08:00"^^xsd:dateTime |
-----------------------------------------------------------------------------------------

Read the Add Support Durations, Dates, and Times issue mentioned above for more details about the expanded support for manipulation of date and time data. You’ll see some great new things that we’ll be able to do with a lot of existing data, especially with all those date values in Wikidata.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Passing your own data to use in Wikidata visualizations

Bob DuCharme — Sun, 23 Jul 2023 11:15:00 +0000

A VALUES-based approach.

I’ve had a decent understand of what the VALUES keyword can do for a while (see SPARQL 1.1’s new VALUES keyword and “Creating Tables of Values in your Queries” in my book Learning SPARQL) but lately I’ve gained a greater appreciation of ways to use it. For example, last month I used it to map codes assigned by an entity recognition tool to schema.org classes. This month I found a nice way to use it to control one of Wikidata’s many cool data visualization possibilities. By sending the Wikidata query service some data in a VALUES clause in my query, I don’t have to rely completely on what’s in Wikidata to drive the visualization.

Wikidata’s timeline visualization lets you view a chart of events displayed in the order in which they happened–for instance, the launch date of space probes, which Wikidata’s sample timeline query asks for. Requesting two dates in your query result adds bars to the display to visually show the elapsed time between those dates.

As their demonstration shows, this is pretty simple if you can come up with a query for the things you want displayed on the chart. But what if the entities you want to see there don’t have anything in common that you can query for?

In my case, I wanted to create a timeline of Shakespeare and certain famous people who lived in his time. I wanted to see two composers, two scientists, and the “statesman” Walter Raleigh, but I knew of no single query that would return these six people. (I read books on music and science that mention one or the other so I thought it would be nice to see just how contemporary they were.)

The VALUES query made it easy. I used Wikipedia and Wikidata to find the identifier for each person (for example, for one of them, I clicked Wikidata item on Palestrina’s Wikipedia page and saw from its URL that his identifier is Q179277) and added them to the VALUES list with a wd: prefix. The comment at the top of the query tells Wikidata that I want to see the Timeline visualization:

#defaultView:Timeline
SELECT ?name ?dateOfBirth ?dateOfDeath 
WHERE {
  VALUES ?person { wd:Q692 wd:Q179277 wd:Q53068 wd:Q307 
                   wd:Q9191 wd:Q189144 }
  ?person wdt:P569 ?dateOfBirth ;
          rdfs:label ?name ; 
          wdt:P570 ?dateOfDeath .
  FILTER ( lang(?name) = "en" )
}

You can try it out yourself. Here is the result:

As I showed in last month’s blog entry, a VALUES clause can hold two-dimensional sets of data in addition to simple lists like my query about Shakespeare’s contemporaries. This enables even more possibilities when using your own data with Wikidata’s wide choice of visualization tools.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Entity recognition from within a SPARQL query

Bob DuCharme — Sun, 25 Jun 2023 10:53:00 +0000

Using my new employer's excellent free product.

I recently announced that I have joined Ontotext as a full-time Senior Tech Writer. I have admired their free GraphDB triplestore for a long time (for example, I wrote about how well it supports the GeoSPARQL geospatial extension in October of 2020) and I am now learning about all the great capabilities of their commercial products, such as the scalability of GraphDB Enterprise.

As always, though, in this blog I will focus on free RDF-related software, so this month I will write about a cool feature of GraphDB Free that I just learned about just last week: its use of the spaCy library to let you do text analysis and entity recognition from within a SPARQL query.

The Text Mining Plugin page of the GraphDB documentation describes text mining protocols that it supports: spaCy, GATE Cloud, and Ontotext’s Tag API. The spaCy section of that page shows the two lines necessary to create and then run a spaCy client with docker, and then it shows a SPARQL INSERT DATA command that establishes a connection from GraphDB to the spaCy client. Once that’s done you’re ready to run queries that tell spaCy to analyze content that you pass to it.

The Find spaCy entities through GraphDB section that follows that shows a query that passes a paragraph of text about Dyson Vacuum Cleaners to spaCy and and returns several columns of information about how spaCy annotated it to indicate the entities that it found. Beneath that on theText Mining Plugin page you can see the results: it identifies “Dyson Ltd.” as an organization, James Dyson as a person, Singapore as a geopolitical entity, and more. (While that documentation shows six of the returned rows, I got twelve when I ran it.)

That query was a SELECT query. I wanted to run a CONSTRUCT query that would create new triples about some of the identified things. If it recognized people, places, and organizations, I wanted it to create triples making those instances schema.org classes. Revising the SELECT query mentioned above, I ended up with this:

# getting triples from endpoint with this query: 
# curl -H "Accept: text/turtle" --data-urlencode \
# "query@spacytest.rq" http://bob-inspiron:7200/repositories/my_repo

PREFIX txtm:      <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX s:         <https://schema.org/>
PREFIX rdfs:      <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc:        <http://purl.org/dc/elements/1.1/>

# Don't forget to start the spaCy server and run the INSERT query
# that establishes the connection to it before running this query. 

CONSTRUCT {
        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText .

        ?entityID a ?soClassname ; 
                  rdfs:label ?annotationText . 
        # The annotation has a related resource: this
        # new resource being declared. 
        ?annotation dc:relation ?entityID .
}
WHERE {
  ?searchDocument a txtm-inst:localSpacy;
     txtm:text '''Dyson Ltd. plans to hire 450 people globally, with
     more than half the recruits in its headquarters in Singapore.
     The company best known for its vacuum cleaners and hand dryers will
     add 250 engineers in the city-state. This comes short before the founder
     James Dyson announced he is moving back to the UK after moving residency
     to Singapore. Dyson, a prominent Brexit supporter who is worth US$29
     billion, faced criticism from British lawmakers for relocating his
     company''' .

    GRAPH txtm-inst:localSpacy {
        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText ;
                    txtm:annotationKey ?annotationKey;
                    txtm:annotationType ?annotationType ;
    }
    VALUES (?annotationType ?soClassname) {
      ("ORG"    s:Organization) 
      ("GPE"    s:AdministrativeArea)
      ("PERSON" s:Person)
    }

    # Create a URI to use as the subject of each newly
    # recognized entity being declared as a schema.org class. 
    BIND(UUID() AS ?entityID)
}

The WHERE clause grabs the information generated by spaCy like the WHERE clause in the original SELECT query in the GraphDB documentation does. It also uses SPARQL’s VALUES clause to map spaCy annotation types to schema.org classes. (With more input text, I’m sure spaCy would recognize more types of entities, so you could easily extend this VALUES list to accommodate those.) Then instead of a SELECT clause, I have a CONSTRUCT to create triples saying that the recognized entities are instances of the appropriate classes.

This is only a beginning. For example, spaCy recognizes Singapore as a geopolitical entity in two different places, but it doesn’t know that the two identified entities are the same thing, so my query creates a separate s:AdministrativeArea instance for each. There are tools that could be used further down the pipeline to straighten this out and maybe connect it to http://www.wikidata.org/entity/Q334, the Wikidata identifier for Singapore; because this CONSTRUCT query creates triples instead of a table of results, it will be much easier to pass the result of its work down a pipeline to other tools that can do further enhancements.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Getting ChatGPT to turn a flat vocabulary list into a hierarchical taxonomy

Bob DuCharme — Sat, 20 May 2023 11:15:00 +0000

ChatGPT-3, Chat GPT-4.

I was catching up with my old friend Paul Prescod the other day. We have not only known each other since the early days of XML, but actually before that: “since XML was a four-letter word”, to quote Paul.

One current popular topic we discussed is where LLM tools such as ChatGPT can add value in the data pipelines that we have worked with. We’ve all seen blog posts where people got ChatGPT to create code in their favorite languages; Paul and I, as always, were focused on how it could improve content and content metadata. I’ve often said that the point of metadata is to add value to content, so automating the creation of useful metadata is automating the addition of value to content.

Automating the assignment of keyword terms from a controlled vocabulary to content, in order to improve content findability, has been a classic goal for decades. While talking to Paul, I wondered whether the controlled vocabulary itself could be improved by ChatGPT, specifically by turning flat lists into hierarchies.

How does this add value? Imagine that Sidney at the hypothetical Snee Company stores a picture of Lassie tagged as “Collie” in a CMS, and that term in the CMS’s taxonomy has a link to the broader term “Dog”. Taylor, another Snee employee, is writing an article about hints for taking your pets on vacation and searches the CMS for dog pictures. Sidney didn’t tag the Lassie picture as “Dog”, but the taxonomy-aware search engine knows that it shows a collie, which therefore makes it a dog, and returns that picture to Taylor. Taylor found a good picture for the article and has benefited from the value added by this piece of metadata.

I thought I’d create a controlled vocabulary of animal species and broader terms as a simple flat list and see how well ChatGPT-3 could impose some hierarchical structure on this by adding links such as the Collie-to-Dog one described above. Of course, I would have it use the RDF-based standard SKOS standard for controlled vocabularies, taxonomies, and thesauri.

A web search showed that Kurt Cagle, another old friend from XML’s early days, had given me a nice head start in his recent posting Nine ChatGPT Tricks for Knowledge Graph Workers. His list of animals (see “Example 8. Taxonomy Construction” in that article) sorted and indented the terms to show their hierarchy. I wanted to ChatGPT to do that work, so I made a copy of Kurt’s list, removed all the leading spaces, and sorted the lines into a random order. (Cool Linux command line tool I found for that: shuf.)

I then wrote a one-off Perl script that converted the list to SKOS Turtle RDF. All it said was that these were concepts with these labels. Here are the first 15 lines; the remainder follows the pattern shown:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> . 
@prefix d:    <http://learningsparql.com/ns/data#> .

d:c1 a skos:Concept ;
     skos:prefLabel "Tigers" .

d:c2 a skos:Concept ;
     skos:prefLabel "Bears" .

d:c3 a skos:Concept ;
     skos:prefLabel "Mammals" .

d:c4 a skos:Concept ;
     skos:prefLabel "Primates" .

I wanted to see if ChatGPT could add triples such as d:c4 skos:broader d:c3. To summarize my result, the free ChatGPT-3 did OK, but not great; when Paul later tried it with ChatGPT-4, for which he is paying for a subscription, that did much better. Important ChatGPT lesson here: you get what you pay for.

Here is the prompt that I gave to ChatGPT-3:

Take the following set of Turtle RDF triples and add more triples that use skos:broader as their predicate. The new triples should show the hierarchical relationship of the existing terms.

Below that prompt I pasted the data that is excerpted above; you can see the whole thing at https://bobdc.com/miscfiles/simpleSKOS.ttl.

The system responded with my original RDF triples and the new ones that it generated based on what I asked for. The prefix declarations at the top were missing their angle brackets, so I added those to make it parse properly. I then wrote the following SPARQL query to process the returned RDF and show me a report that was more intuitive to read than statements like d:c4 skos:broader d:c3:

prefix skos: <http://www.w3.org/2004/02/skos/core#> 
prefix d:    <http://learningsparql.com/ns/data#>

SELECT ?narrowerLabel ?broaderLabel WHERE {
  ?narrower skos:prefLabel ?narrowerLabel ;
            skos:broader ?broader .
  ?broader skos:prefLabel ?broaderLabel . 
}

Here is the resulting report:

----------------------------------
| narrowerLabel  | broaderLabel  |
==================================
| "Mammals"      | "Bears"       |
| "Mammals"      | "Tigers"      |
| "Coyotes"      | "Canines"     |
| "Felines"      | "Animals"     |
| "Vertebrates"  | "Animals"     |
| "Ursines"      | "Bears"       |
| "Wolves"       | "Canines"     |
| "Animals"      | "Vertebrates" |
| "Primates"     | "Mammals"     |
| "Chimpanzees"  | "Primates"    |
| "Canines"      | "Mammals"     |
| "Lions"        | "Bears"       |
| "Apes"         | "Primates"    |
| "Carnivores"   | "Animals"     |
| "Insectivores" | "Carnivores"  |
| "Badgers"      | "Carnivores"  |
| "Panthers"     | "Felines"     |
| "Humans"       | "Felines"     |
----------------------------------

A few are completely wrong but it usually made sensible connections. As you can see, many of the connections are backward, like the first two.

Paul had much better luck doing the exact same thing with ChatGPT-4. He also did a lot of prompt refinement to get the system to explain why it did what it did. He has promised me that he will be writing that up soon, and when he does I will link to his writeup from here. It’s an interesting start to an answer for one of the important questions of 2023: “What useful work can I get Large Language Models to do for me?”

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Introducing RDF and related standards

Bob DuCharme — Sun, 30 Apr 2023 11:01:00 +0000

The series.

A co-worker recently told me that he was considering using OWL for something but didn’t want to deal with all that XML. It was disappointing to hear this in the year 2023, but I guess those early images of RDF/XML being used to implement OWL restriction classes really scared some people off.

There have been other occasions where I wanted to suggest that someone could learn the basics of RDF, OWL, and the related standards by reading the introductions I did for each as blog entries in 2021, but I hate to send these people five URLs or tell them to go to the list of 2021 entries and read the June through October entries in the reverse of the order shown.

So now there’s one URL for a table of contents page at A brief introduction to RDF, related standards, and what they can do for you. It lists the relevant entries, in order, with a brief description of each. I hope that it’s useful for others who want to provide a simple, brief—if a bit opinionated—introduction to those who want to ramp up quickly on these topics. (I didn’t mean for this table of contents page to be displayed with the full blog entry theming, but to take advantage of the CSS and so forth that I’m using with the Hugo blog engine I just used the existing template without creating a new one for it.)

I wrote a similar article many years ago titled RDF, The Semantic Web, and Linked Data that did a broader summary with references to many more blog entries. Much of that, such as its references to XMP and RDFa, are quite dated now. That might be interesting to people for its historical perspective from 2009, but the brevity of this new summary that I wrote should make it more helpful overall.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Normalizing company names (and more) with SPARQL and Wikidata

Bob DuCharme — Sun, 26 Mar 2023 11:16:00 +0000

As a service!

Several years ago I wrote Normalizing company names with SPARQL and DBpedia to describe how SPARQL queries to DBpedia let you take advantage of the Wikipedia logic that redirects the URL http://en.wikipedia.org/wiki/Big_Blue to https://en.wikipedia.org/wiki/IBM and http://en.wikipedia.org/wiki/Bobby_Kennedy to http://en.wikipedia.org/wiki/Robert_F._Kennedy. This lets SPARQL queries normalize names—a useful task to perform for data cleanup.

This time I did it with Wikidata. As with the DBpedia version, I did it using a SERVICE call to Wikidata so that a query that is running somewhere besides Wikidata can take advantage of this. This time I also showed how to make it work for countries as well as companies. Minor changes should make it work for most other classes.

Because my new version uses alternative names instead of redirect data, the name of the company that it returns is the name on the Wikipedia page, not the official name of the company. So, for example, instead of “Eastman Kodak Company” it will return “Kodak” and instead of “Apple, Inc.” it will return “Apple”. Still, the normalization down to a single name is useful.

Here is the query I entered into the Wikidata Query Service to send to its endpoint. You can also see and execute the query with this query link.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd:   <http://www.wikidata.org/entity/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?wdName                   # Find the Wikidata name
WHERE {
  # A few test cases to choose from
  BIND("Big Blue"@en AS ?englishName)    # Search based on a nickname.
  #BIND("IBM"@en AS ?englishName)         # Search based on Wikipedia name.
  #BIND("Apple Inc."@en AS ?englishName)  # Search based on official name.

  BIND (wd:Q4830453 as ?entityClass)      # Business is the entity class...
  ?entity wdt:P31 ?entityClass .          # that we want to search.
  
  # Union of two sets of triples: entities that have the input name as
  # an alternative name and those that have it as their official name. 
  {       
    ?entity skos:altLabel ?englishName ; 
            rdfs:label ?officialName .
    FILTER ( lang(?officialName) = "en" )
  }
  UNION
  { ?entity rdfs:label ?englishName . }
  
  # Get the official name if it was bound, otherwise the
  # the English name if part 2 of the UNION found it. 
  BIND(STR(COALESCE(?officialName,?englishName)) AS ?wdName)
}

I won’t discuss the query much here because it’s heavily commented. You can also read about the query logic in my earlier article on doing this with DBpedia (especially the use of the under-appreciated COALESCE function). I will say that, although Wikidata sometimes uses its own vocabulary to express ideas that could have used basic parts of the standard (for example, using wdt:P31 instead of rdf:type to show class membership), it’s nice to see the RDFS and SKOS vocabularies used as part of the data I was retrieving.

As a SERVICE…

You can run the query below with any SPARQL processor that has access to the Internet so that it can make the SERVICE call. I ran it locally with arq in a query that demonstrates batch processing of names to normalize.

This query is more flexible than the one above, letting you disambiguate terms from different classes—in my example, both company and country names. (I tried it with people as well, but there are just too many famous people who share their name with someone else such as the various Michael Jordans and Dave Thomases.) You just need to include the class URI in the input.

Here is the sample input, listing some entities by their Wikidata names and some by alternative names that they are known for.

@prefix s:  <http://learningsparql.com/ns/sample/> .
@prefix wd: <http://www.wikidata.org/entity/>

s:company1 a wd:Q4830453;
           s:name "Kodak" .

s:company2 a wd:Q4830453;
           s:name "Big Blue" .

s:company3 a wd:Q4830453;
           s:name "Coca Cola" .

s:country1 a wd:Q6256;
        s:name "The UK" .
        
s:country2 a wd:Q6256;
        s:name "Nigeria" .
        
s:country3 a wd:Q6256;
        s:name "U.S." .

The query is a CONSTRUCT query, so it will return triples: the original data, the Wikipedia name, and the Wikipedia qname so that something further down the processing pipeline can retrieve more data from Wikipedia about the entity.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd:   <http://www.wikidata.org/entity/>
PREFIX wdt:  <http://www.wikidata.org/prop/direct/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX s:    <http://learningsparql.com/ns/sample/>

CONSTRUCT {
  ?entity rdfs:label ?wdName ;
          s:name ?name ; 
          s:wikidataURI ?wikidataEntity . 
}
WHERE {
  ?entity a ?entityClass;
          s:name ?name . 
  
  BIND(STRLANG(?name,"en") AS ?englishName)
  
  SERVICE <https://query.wikidata.org/sparql> 
  # Look for something with that name and entity class. 
  {
    ?wikidataEntity wdt:P31 ?entityClass . 
    {       
      ?wikidataEntity skos:altLabel ?englishName ; 
         rdfs:label ?officialName .
      FILTER ( lang(?officialName) = "en" )
    }
    UNION
    { ?wikidataEntity rdfs:label ?englishName .
      FILTER ( lang(?englishName) = "en" )
}
  }
  BIND(STR(COALESCE(?officialName,?englishName)) AS ?wdName)
}

The SERVICE clause passes off some of the logic to happen on Wikidata, and the rest executes locally with my copy of arq. The rest of the syntax is very close to the more heavily-commented example above.

Here are the results I get:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix s:    <http://learningsparql.com/ns/sample/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix wd:   <http://www.wikidata.org/entity/> .
@prefix wdt:  <http://www.wikidata.org/prop/direct/> .

s:company1  rdfs:label  "Kodak" ;
        s:name         "Kodak" ;
        s:wikidataURI  wd:Q486269 .

s:country2  rdfs:label  "Nigeria" ;
        s:name         "Nigeria" ;
        s:wikidataURI  wd:Q1033 .

s:company2  rdfs:label  "IBM" ;
        s:name         "Big Blue" ;
        s:wikidataURI  wd:Q37156 .

s:country3  rdfs:label  "United States of America" ;
        s:name         "U.S." ;
        s:wikidataURI  wd:Q30 .

s:country1  rdfs:label  "United Kingdom" ;
        s:name         "The UK" ;
        s:wikidataURI  wd:Q145 .

s:company3  rdfs:label  "The Coca-Cola Company" , "Coca-Cola Hellenic" ;
        s:name         "Coca Cola" ;
        s:wikidataURI  wd:Q3295867 , wd:Q1104910 .

The fact that the “Coca Cola” entry returns two companies shows that this may not completely normalize a given name. We can automate the identification of which of the output entities have more than one Wikidata name and therefore need some review with this query on the output of the CONSTRUCT query above:

PREFIX  s:  <http://learningsparql.com/ns/sample/>

SELECT ?name (COUNT(?uri) AS ?uriCount)
WHERE {
  ?s s:name ?name ;
     s:wikidataURI ?uri 
}
GROUP BY ?name
HAVING (?uriCount > 1)

Or, you could try to find some logic related to how Wikidata models companies as a way to pick just one of the companies—or countries, because this issue comes up with them as well. The nice thing is that it the query can work with different classes of entities, so it provides a foundation to build on.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Using the AWS Graph Explorer with Fuseki and local datasets

Bob DuCharme — Sun, 26 Feb 2023 11:03:00 +0000

An open source visual graph navigator.

When I first heard about the AWS Graph Explorer I assumed that it was a cloud-based tool for use with Neptune, the AWS cloud-based triplestore. After I read Fan Li’s First Impressions of the AWS Graph Explorer I realized that you can install this open source tool locally and point it at any SPARQL endpoint you want, so I cranked up Jena Fuseki on my laptop, loaded some data into it, and installed the Graph Explorer.

The first three of the Steps to install Graph Explorer on the project’s git repo readme page were all I needed. For step two, where it says {hostname-or-ip-address}, I just put localhost, which is also what Fan did, and that worked fine.

After I did step three’s docker run command to run the Docker container, I didn’t need to do the remaining steps on the list, which were for running this on an EC2 instance. I sent a browser to https://localhost:5173/ and that got redirected to https://localhost:5173/#/connections, which is where you create and manage connections to the data sources with the data you want to visualize.

For some local data to explore I loaded the W3C’s vcard business card ontology into Fuseki. (The ontology file is available at its namespace URI, http://www.w3.org/2006/vcard/ns. It’s always nice when a namespace URI is a URL pointing to the ontology itself.) I also made a little file with four instances of the ontology’s Individual class (the following plus three variations) and loaded that into Fuseki to see how the Graph Explorer showed them.

ex:r4 a v:Individual ;
      v:given-name "Dana" ;
      v:family-name "Williams" .

The next step was to connect the Graph Explore to the dataset. I clicked the plus sign on the connections screen mentioned above to display the “Add New Connection” dialog box. I only had to fill out two things there:

I changed the default value for “Graph Type” from “PG (Property Graph)” (which supports the Apache Tinkerpop variety) to “RDF (Resource Description Framework)”.
For the endpoint, I learned from Fan that the “Public or Proxy Endpoint” field assumes that your SPARQL endpoint ends with /sparql, so you shouldn’t include that when entering a URL for that. The endpoint URL for this vcard dataset on Fuseki was http://localhost:3030/vcard/sparql, so I entered http://localhost:3030/vcard/ in that field:

After clicking “Add Connection”, the next screen will either have a message “Connection successfully synchronized” on the right or an error message. My error messages were due to accidentally picking PG as the graph type and using the full endpoint URL instead of omitting the /sparql part.

If you use this screen’s plus sign to add additional connections, the panel on the left will list them with an “Active” toggle to the right of each one’s name to select it. Clicking the circular arrows in the upper-right next to “Last Synchronization” will update the data that the Explorer is using from the data source. This was useful for me when I loaded the vcard ontology into Fuseki, created a Graph Explorer connection to view it, and then added the sample instance data into that Fuseki dataset and wanted Graph Explorer to use that new instance data as well as the ontology data.

Once you have a connection up and running, the Explorer’s Graph View tells you “To get started, click into the search bar to browse graph data. Click + to add to Graph View.” I searched for “Dana” in the search bar, got the results shown on the left here, and clicked that to see the additional details on the right:

Clicking “Add Selected” in the lower right of that dialog box put this instance on the Graph View with data underneath it in the Table View. Double-clicking the instance there expanded the graph a bit:

I’ve barely scratched the surface here. vcard is a fairly rich ontology, so exploring its structure was also fun. With a name like “AWS Graph Explorer” they are obviously pushing it for use with cloud-based datasets, but I was happy to see how easily it works with small local setups as well. To learn more before you try it out, don’t miss Fan Li’s description of his experiments with this tool.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

SPARQL and OWL on the command line—of my phone!

Bob DuCharme — Sun, 22 Jan 2023 11:15:00 +0000

Termux and rdflib on my Android phone.

I recently wondered “could I run a Python script that includes the rdflib library on my Samsung Android phone?” Five minutes later, I was doing it, and about three of those minutes were spent installing Python.

The termux terminal emulator lets you treat your Android phone as a regular linux machine. (I even have a Bluetooth keyboard for when I’m getting super geeky with termux on my phone by running Emacs or command line git.) From the termux command line, I installed Python and rdflib the same way I would on any machine:

pkg install python
pip install rdflib

I pasted the Python script in “A tiny example” from the rdflib getting started page into a text file and ran it with no problem five minutes after wondering if all this would work.

To point my phone’s Python scripts at the interesting place that pkg install put my Python executable, I did have to put this as my first line:

#!/data/data/com.termux/files/usr/bin/python

To try a SPARQL query, I took the first example on the rdflib documentation’s Querying with SPARQL page, substituted the URL of my own ancient FOAF file http://snee.com/bob/foaf.rdf as the parameter for the demo script’s g.parse() call, and the Python script ran fine with the expected output of the script’s SPARQL query.

Then I got ambitious and tried some OWL inferencing. After I did the pip install owlrl command shown at the top of the owlrl home page in termux, everything from my blog entry My command line OWL processor then worked fine. I took the script shown at the end of that blog entry and only had to make one change (not to run on my phone, but I’m guessing because the library has evolved a bit): I removed .decode() from the last line.

To review the goal of the “command line OWL processor” demo, it started with Turtle data about musicians, their instruments, and the states that they were from, such as:

d:m2 rdfs:label "Charlie Christian" ;
     dm:plays d:Guitar ;
     dm:stateOfBirth d:TX .

d:m4 rdfs:label "Kim Gordon" ;
     dm:plays d:Bass ;
     dm:stateOfBirth d:NY .

It also declared three OWL restriction classes:

dm:Guitarist as resources that have a d:Guitar value for their dm:plays property.
dm:Texan as resources that have a value of d:TX for their dm:stateOfBirth property.
dm:TexasGuitarPlayer as the intersection of the dm:Guitarist and dm:Texan classes.

The sample data did not identify any of the musicians as instances of any classes; finding this out required OWL inferencing, and the owlrl library made this possible. Below you can see the script being invoked:

In this excerpt from its output, you can see that the OWL inferencing happened, and that resource http://learningsparql.com/ns/m2 (Charlie Christian) is an instance of all three restriction classes:

In Converting sqlite browser cookies to Turtle and querying them with SPARQL I wrote that most of your computing devices probably have some SQLite data on them, and I showed that converting this data to RDF is pretty easy. sqlite3 was as easy to install with termux as the other packages above; using it to reach the SQLite data on my phone was another matter. As you might guess from where pkg install put my Python executable, termux has its own storage section on my phone. From there I can access music files, downloads, and other files on my phone from the termux command line, but I couldn’t find any SQLite data in my phone’s termux storage area. Rooting my phone would give me access to more, so that’s something to consider.

Still, as the first two examples above show, a Python script with rdflib can retrieve data from the Internet and run SPARQL queries on that, so that provides some interesting possibilities. The most pleasant surprise for me about all this was just how easy it was to use this set of tools on my phone.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Web3 and Web 3.0 at OriginTrail

Bob DuCharme — Sun, 18 Dec 2022 13:17:00 +0000

An interview with CTO and co-founder Branimir Rakić

OriginTrail is doing one of the most interesting combinations of blockchain technology and RDF that I have seen. In November I spoke with CTO and co-founder Branimir Rakić.

Tell me about OriginTrail.

OriginTrail is both an ecosystem and technology stack. Its mission is to grow an open, permissionless system for discovering, verifying and querying valuable assets, be they physical or digital. It merges the benefits of two technologies—blockchains and semantic tech (both named Web3, at different times), hence forming a Decentralized Knowledge Graph, or DKG for short.

With this “merge” of technologies OriginTrail enables innovative applications that transition from “managing data” to managing assets, with associated tools such as data marketplaces, knowledge tokens, and user-tailored search. It operates on a network of hundreds of nodes run by individuals and companies around the world (including the British Standards Institution, US retailers, and Swiss Railways), based on open source tech and standards such as W3C RDF/SPARQL and emerging Decentralized Identifiers and Verifiable Credentials. OriginTrail DKG can be seen as “middleware”, connecting different (often legacy) systems in a novel “Semantic Web3” network.

How would an interested user get started using this?

One of the best ways to start is to explore the official documentation. With a pending update of the network to version 6 (in December), we’re also about to release an updated version of the documentation with example tutorials so that would be a great starting point.

Naturally, knowing about graphs and SPARQL would be also a good start.

You can also develop graph-native Web3 applications, interfacing with assets on the DKG using the OriginTrail SDK. There are currently two SDKs available; one is available on the Oracle Cloud Marketplace and another one on DigitalOcean.

I guess what I mean is, what would a brand new user set about creating as a first step of using OriginTrail?

Broadly speaking a user could observe OriginTrail DKG as a global decentralized graph “database” which one can publish or query knowledge assets from. Both of these can be done using the DKG libraries (such as dkg.js) or public web interfaces (OriginTrail’s Project “Magnify”, currently in private beta).

Writing (or “publishing”) would, as a first step, entail prep work on the published information (triple generation) and publishing those as knowledge assets into DKG records.

For querying the DKG one could explore the existing knowledge assets (for example, via Project Magnify interface) and run SPARQL queries on it.

Apart from being a user, since OriginTrail is a permissionless decentralized system, one can also become a “system operator” by running an OriginTrail Network node and host the DKG state. For hosting the state, nodes collect publishing fees in the form of TRAC tokens.

I found this description on reddit; would you consider it accurate?

“OriginTrail allows anyone to store knowledge assets on its decentralized network of nodes by paying a fee. Those assets can then be queried, verified and made valuable because of the relationships that can be represented in the knowledge graph and also because of the interoperable nature of the platform.”

That is a pretty good description.

Is RDF a typical format for publishing this data?

Starting with the latest version 6, yes. This is about to reach productive state (release on the OriginTrail DKG main network) in December.

Is there SPARQL access to the published data?

Yes. There are two ways to query the data with SPARQL. One is through a SPARQL service (one is provided in Project Magnify) which provides a gateway into the DKG. The other would be to run your own gateway by running an OriginTrail Node.

On top of SPARQL access, one can verify the integrity of each triple in the graph by associating it with the issuer’s public key (associated with its blockchain account) and Merkle proofs.

There is some way to plug your own triplestore into a node, right?

Yes, absolutely. The node connects to a triple store and is decoupled from it. It currently supports Apache Jena (Fuseki), Blazegraph and GraphDB, with plans to extend direct support for others. Essentially, you can consider the node as a “modem” for your triple store that connects it with other nodes and uses blockchains for verification and transactions.

Nodes come in two flavors—full and light nodes, where light nodes do not have a triple store of their own and are not participating in running the system, but are able to perform operations on it such as publish and query.

If I’m going to publish data on one of these nodes and sell access to it, what are the potential mechanisms for my customers to pay for this data?

The payment mechanisms come in several flavors—paying with TRAC tokens, or paying with “Knowledge Tokens” (kTokens) which you can create on your own.

This enables you (as Bob) to create e.g. 1000 Bob tokens, which you can sell via the blockchain as “pay as you go” access tokens for your data. This enables interesting novelties such as the application of market mechanisms for price discovery on your data.

To briefly elaborate:

The data you are selling would be private (kept by you, in a triple store of your choice, connected to a DKG node of your choice).
Metadata about it would be published on the DKG, to make it discoverable.
Based on your decision on how to implement payments, you could opt for one of the above options.
When a buyer would discover your data, they would initiate the purchase via OriginTrail smart contracts by locking the right amount of tokens (escrow fashion).
Your node would verify the initiation of the transaction (tokens in escrow) and package the data for consumption and verification of the buyer.
Data is swapped for tokens. Using a “Proof of Misbehavior” system, tokens will only be spent if the original data has been transmitted.

OriginTrail’s website mentions the use of the W3C Decentralized Identifiers (DIDs) specification. What does this provide to your technology?

Decentralized identifiers are the key piece of tech enabling the blockchain side of things, and the core component of UALs (Universal Asset Locators—URLs in Web3, with resources being assets). With UALs, DIDs enable a standard for provisioning ownable identifiers without a need for a central authority and without dependency on a specific technology. OriginTrail is designed to be blockchain agnostic and, via this standard, can reference any object on any decentralized network (including the DKG itself).

With DIDs one can identify and interact with data issuers, verify integrity of data and fully control their identifiers.

It sounds like this is helping to tie the blockchain technology and the W3C standards-based technologies together.

Indeed it does, and it’s one of the recommendations getting the most traction, together with W3C Verifiable Credentials.

That is great to hear. There are a lot of W3C Recommendations that are nice in theory but not being applied anywhere.

What kind of OriginTrail customers are using it for what kinds of applications?

OriginTrail has been used quite a bit by enterprises. The Swiss Railway company uses it to track rail parts and maintenance events. Several food and beverage producers (whiskey, poultry, beef etc.) use it to show ingredient provenance information to their consumers. The British Standards Institution (BSI) issues verifiable certificates for their trainings on the DKG, and US retailers such as Walmart, Target, and The Home Depot use it to exchange factory audit reports among each other in a privacy preserving fashion. The World Federation of Hemophilia NGO uses it to track donated vaccines and medicine.

Most of these applications built on top of OriginTrail aggregate information from different sources (for example, rail companies, food supply chain companies, and factories) and perform various graph traversal queries to obtain product histories and discover associated events. Many of them also use OriginTrail together with GS1 EPCIS and CBV data models; GS1 is to the supply chain world what W3C is to the Web.

Supply chain applications seem to be a theme there. Are any of them using RDF?

Most of the applications mentioned are either already fully RDF-based or being migrated to RDF. Specifically, the ones using GS1 standardization are benefiting from RDF as it enables a great extension to the descriptive capabilities of those standards. The EPCIS 2.0 standard, which came out recently and we helped co-create through the GS1 Working group, makes this easy, as it’s created with RDF compatibility in mind. RDF and SPARQL are an important component of making these implementations easily extendable.

Is there anything else you’d like to add?

Just to reiterate that we are about to launch the latest OriginTrail version (V6) in a couple of weeks time and are excited to showcase the new capabilities unlocked by incorporating RDF/SPARQL into the tech stack with the wider audience. The great thing about OriginTrail is that it has a vibrant community of technologists and enthusiasts who help create content in and around the DKG. It’s a truly global community with lots of resources, so I encourage everyone who is interested in finding out more to join our Discord and check out the community created resources that can be found on our linktree site.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

SPARQL queries of git repository data

Bob DuCharme — Sun, 20 Nov 2022 10:58:00 +0000

If we're going to think of git data as a graph...

Justin Dowdy recently created an open source project to convert the metadata in a git repository to RDF, and I’ve been having some fun with it. Before getting into the details, as a brief demo I’ll start with a sample SPARQL query that I did to list all of the 2019 commits in my misc github repo:

PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX wd:      <http://www.wikidata.org/entity/> 
PREFIX x:       <http://www.w3.org/2001/XMLSchema#>
PREFIX gist:    <https://ontologies.semanticarts.com/gist/> 

SELECT ?title ?dateTime WHERE {
  ?commit a wd:Q20058545 ;  # it's an instance of the commit class
            dcterms:subject ?subject ;
            gist:atDateTime ?dateTime . 
  ?subject  dcterms:title ?title .
  FILTER (?dateTime >= "2019-01-01T00:00:00"^^x:dateTime && 
          ?dateTime < "2020-01-01T00:00:00"^^x:dateTime)
}

It produced this result:

title                                     dateTime
-----                                     --------
adding sqlite rdf files                   2019-07-13T16:19:39-04:00
added tableList.scr                       2019-07-13T16:21:39-04:00
adding readme                             2019-07-28T12:00:55-04:00
added files to go with 2019-10 blog entry 2019-10-20T16:46:07-04:00

Justin’s software that makes this all possible is at https://github.com/justin2004/git_to_rdf.

Once I installed that software and created a /home/bob/temp/rdf directory, the following variation on the command line from Justin’s github page read my local copy of the misc repo and put 35,353 triples about it in two files in /mnt/temp/rdf:

/home/bob/git/git_to_rdf/git_to_rdf.sh \
  --repository /mnt/git/misc  --output /mnt/temp/rdf

(Referencing /home/bob/temp/rdf as /mnt/temp/rdf is a Docker thing that I don’t completely understand myself. Justin said that he is working to simplify that.) I loaded the new triples into Jena Fuseki and tried a few of my Queries to explore a dataset that I typically use, which is how I found out that it had 35K triples.

To really understand the possibilities, read Justin’s blog entry Git Repositories as RDF Graphs. I especially like how it explained that he didn’t necessarily have to make “thoughtful” RDF (well-modeled RDF that takes advantage of standard vocabularies) and why and how he did so. His blog entry also includes a nice diagram of his data model, generated with RDFox, that you’ll want to keep handy while you develop any queries for you own git repo data converted to RDF.

Several of his sample queries will be especially useful for querying git repos that have commits from multiple people. He demonstrates these with RDF generated from the repo for the cURL utility that I have written about here many times. My misc repo that I used to generate RDF only has commits from me, so these sample queries were less useful to me, but they still provided a good model for how to get at certain kinds of repo information.

To build on what he wrote there I wanted to create at least one more query that was different from his examples, so I created this one to find the commits that used blocks of text with the word “music” in them:

PREFIX wd:      <http://www.wikidata.org/entity/> 
PREFIX gist:    <https://ontologies.semanticarts.com/gist/> 
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?commitTitle ?commitTime ?filename ?textLine  WHERE {
  
  ?commit a wd:Q20058545 ; # it's a commit
          gist:hasPart ?part ;
          dcterms:subject ?commitSubject ;
          gist:atDateTime ?commitTime . 
  
  ?commitSubject dcterms:title ?commitTitle .
  
  ?part gist:produces  ?contiguousLines .
  
  ?contiguousLines gist:occursIn ?file ; 
                   <http://example.com/containedTextContainer> ?textContainer . 
  
  ?file gist:name ?filename .
    ?textContainer ?line ?textLine .
  
  FILTER(contains(?textLine,"music"))
}

And here is the result:

This combination of the world’s most popular version control system and this ability to to manipulate metadata about what it contains could provide the basis for a Content Management System in the broader original sense of the term: something to manage the storage and workflow of multiple kinds of content for multiple kinds of publication media. (In recent years the term’s meaning has narrowed to mean “platform to help automate web publishing”.)

That’s just one of the possibilities. Read Justin’s blog entry and see what ideas it gives you!

Comments? Reply to my tweet announcing this blog entry.

Your own free, publicly available SPARQL endpoint

Bob DuCharme — Sun, 23 Oct 2022 11:59:00 +0000

Free as in tier.

There are a few tutorials out there about how to start up your own free-tier Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance and then run your own publicly available web server. I’ve planned for a while to try this with a Jena Fuseki triplestore and SPARQL endpoint, but I postponed it because I thought it might be complicated. It turned out to be pretty easy.

EC2 Exercise 1.1: Host a Static Webpage by Kerry Sheldon is an example of one of the tutorials described above, and it was a good starting point for putting up an Apache web server. Because AWS now has a “new launch experience” I couldn’t follow her 2018 instructions exactly, but my first few instructions below are based on hers.

Tell AWS you want to launch an instance

If you don’t have an AWS account, create one. Then log in and pick EC2 on the AWS Console and “Launch Instance” from the orange “Launch Instance” button’s dropdown menu.

Configure and launch the instance

The older version of this “experience” was more of a wizard leading you through various small screens to fill out. The current version has one big screen where you fill out these details:

Add something to the “Name” field like “Fuseki server”.
Pick from the “Application and OS Images” selection. This includes a field where you can search from many choices or, under that, you can pick one of the Quick Start choices. I clicked the blue Amazon Linux AWS Quick Start category and then, under that, picked the first choice: “Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type Free tier eligible”. Scrolling down that list you can see more more machine-learning-oriented images with additional features such as GPUs and PyTorch. This is one of those places where you have to be careful to pick something that will cost you little or nothing, and it’s up to you to keep track of that. (After all my experiments with this project so far, as I write the first draft of this blog entry the AWS billing management screen says that I currently owe them 20 cents.) I went with the first free tier choice mentioned above.
Under that is the “Instance type”. I selected the first choice there, “t2.micro” which is also Free tier eligible. Again, it’s up to you to make the choice that will cost you little or nothing, and some of the choices can be expensive.
Under that, create or select a Key Pair—a public and private key combination that will let you log in to your new instance from your local machine. If you are an AWS user and have an existing one you can pick it from the dropdown list there. If you don’t have one, click “Create new key pair”, give it a name such as fuseki-key-pair, leave the other settings at their default, and click the orange “Create key pair” button. It will create one with a name like fuseki-key-pair.pem that your browser downloads. Save that (a typical destination would be the .ssh subdirectory of your home directory) and remember where you saved it for later.
Moving down the instance configuration page, the next box to fill out is “Network settings”. “Allow SSH traffic from Anywhere” is checked as a default, meaning that anyone can use the ssh utility for shell access to your instance from anywhere on the Internet. (Shell access will also need the file that you downloaded in the previous step, so that’s a somewhat decent level of security. As with the potential costs, it’s up to you to research other configurations if that’s what you need.) Add checks to the “Allow HTTPs traffic” and “Allow HTTP traffic” checkboxes so that browsers and other tools can send HTTP requests to your web server or Fuseki SPARQL endpoint.
Scroll around to see the other things you can set, leave them at their default for this exercise, and click the orange “Launch instance” button. After a few seconds you should sees a screen that say “Success” with an orange “View all instances” button in the lower right. Click that to display the Instances list.

Review your running instance and start a terminal session with it

Sometimes, when doing this, I didn’t see my new instance right away. If this happens to you, wait a minute, reload your browser, and you should eventually see it. The instances list will show that the “Instance state” of your new instance is already “Running”.

Click the checkbox to the left of your instance on the instances list. From the “Instance state” dropdown at the top you will see that this the place to Stop, Start, and Terminate the instance, along with a few other options.

The tabs below the instance list let you do further configuration of the checked instance. The Security tab shows “Inbound rules” that allow inbound traffic on port 22 for SSH, 80 for HTTP, and 443 for HTTPS.

Fuseki uses port 3030 as a default, so add a rule for that: on the Security tab under “Security groups” click the Security group name of sg-long-hex-number and then under Inbound rules click “Edit Inbound Rules”. Click “Add rule” to create a new one with a “Port range” of 3030. Set the sixth column to 0.0.0.0/0 like the others by picking “Anywhere-IPv4” from the fifth column’s dropdown. Leave the Type value at “Custom TCP” and click “Save rules” at the bottom.

Now your instance is all set up. Pick “Instances” under “Instances” (yes, a bit confusing) on the left to return to your Instances list, go back to the Details tab to the left of your new instance’s Security tab, and copy the Public Ipv4 address into your clipboard. I will use 12.345.678.90 in my examples below, so substitute yours for that. There are ways to map these IP addresses to registered domain names, but for this exercise, that address will be your server’s name when you use ssh or a web browser to do anything with it.

Before you log in to your new machine you will need to reset the permissions on the pem file that you downloaded earlier to something acceptable to your ssh utility, because the default permissions after downloading are too permissive. Enter the following, adjusting the path as necessary for the file you downloaded:

chmod 400 ~/.ssh/fuseki-key-pair.pem

(Windows users will have some other command to use instead of chmod, and also may be using PuTTY instead of ssh. I’m not sure of the exact Windows syntax to do these tasks, but they shouldn’t be difficult to find out.)

In a shell window on your local computer, enter the following command, substituting the Ipv4 address that you copied above and pointing the -i parameter to the file that you downloaded earlier:

ssh -i ~/.ssh/fuseki-key-pair.pem ec2-user@12.345.678.90

A prompt will ask if you are sure you want to continue, so answer yes, and then you’ll be logged in to your new instance as it waits for you to tell it what to do:

 
       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-987-65-4-321 ~]$

Download and unzip the Jena software

You will need the software for the Fuseki server itself and also the Jena tools that let you load data into that server and work with that data. (I described some of those tools in the Working with Fuseki datasets from the command line section of my blog post Hidden gems included with Jena’s command line utilities.)

After visiting the Jena download page to find the URLs of these distribution files I executed these commands at the EC2 prompt to retrieve the files to the current directory:

wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.6.1.zip
wget https://dlcdn.apache.org/jena/binaries/apache-jena-4.6.1.zip

(If this posting that you are reading is more than a few months old you’ll want to check the download page yourself to get more recent versions of these files.)

Unzip the two files you downloaded. (For demo purposes, you can just do it from your new instance’s root directory. For a more serious production system you would want to create some directories to organize all this better.)

Install Java

Jena is a Java-based tool, and the default version of this EC2 instance doesn’t have Java, so you have to add it. I found the x64 RPM Package URL on https://www.oracle.com/java/technologies/downloads/. The next two commands pull that package into the EC2 instance and then install it there:

   wget https://download.oracle.com/java/19/latest/jdk-19_linux-x64_bin.rpm
   sudo yum localinstall jdk-19_linux-x64_bin.rpm

Try the Fuseki server

Change into the directory with the Fuseki binary (created by unzipping above) and see if it responds to a simple command:

cd apache-jena-fuseki-4.6.1/
./fuseki-server --help

If you see the help information, that means that you installed Fuseki and Java correctly.

Now let’s start it up for real:

./fuseki-server

Give it a few seconds until the status messages stop scrolling and then send a browser to port 3030 of the Public Ipv4 address you saved earlier. Your URL will be something like http://12.345.678.90:3030/.

You should see the main Apache Jena Fuseki management screen, with the message “No datasets created - add one”. Don’t bother to click on “add one”, because this server doesn’t have permission to write to your new instance’s disk storage, even if you had started fuseki-server with its --update switch. We will load data using the Jena tools.

Create an empty dataset for the triples that you will load

In the shell window where you started up Fuseki, press ^C to shut it down, because the command line tools that you’re about to use don’t work with a server that is up and running. Make the root directory your default and, as a sample data set to load, get the data file I created for SPARQL queries of Beatles recording sessions. With this data loaded in Fuseki, people will be able to query its endpoint about who played what instruments on which Beatles recordings:

cd 
wget https://bobdc.com/miscfiles/BeatlesMusicians.ttl

To tell Fuseki the named dataset on the Fuseki server where you want to load your data, you need to identify the assembler file for that dataset. Your new Fuseki instance has no datasets or assembler files, so how can we create them?

As I explained in the introduction to Working with Fuseki datasets from the command line, instead of learning the syntax of these files I found that I could just create one with the web interface to a Fuseki server running on my local machine, as long as I started it up with the --update switch so that the web interface would have write permission. For that one, I called the dataset that I created dataset2, and Fuseki put the assembler file into ~/apache-jena-fuseki/run/configuration/dataset2.ttl on my local machine. I put a copy of that dataset2.ttl file on my blog’s server so that I could wget it to my EC2 instance. (I could have also sftp’d it from my local machine to the EC2 instance, but this way it’s available to others who want to try the same thing.)

From your EC2 shell’s root directory, execute the following to change into the directory where assembler files get stored, get a copy of the assembler file mentioned above, and rename it for the Beatles data:

cd apache-jena-fuseki-4.6.1/run/configuration
wget https://bobdc.com/miscfiles/dataset2.ttl
mv dataset2.ttl beatlesSessions.ttl

Next, you need to edit it for your new dataset. The vi and nano editors are included with this Amazon Linux 2 image, but I need my emacs, so I installed it:

sudo yum install emacs

Open up beatlesSessions.ttl with your editor. Near the bottom you will see some triples that look like this:

:tdb_dataset_readwrite
	rdf:type       tdb2:DatasetTDB2 ;
        tdb2:location  "/home/bob/bin/apache-jena-fuseki/run/databases/dataset2" .

(Isn’t it nice that the configuration file for this triplestore stores everything as triples? ) Change that tdb2:location value to “/home/ec2-user/apache-jena-fuseki-4.6.1/run/databases/dataset2/beatlesSessions”, do a global replace of “BeatlesSessions” for “dataset2” elsewhere in the file (including in that pathname that you put in in the previous step), save the file, and quit out of your editor.

Now that you’ve created this empty dataset for the server, let’s make sure that Fuseki recognizes it before we load any data. Change into the Fuseki directory and start up the Fuseki server again:

cd ~/apache-jena-fuseki-4.6.1/
./fuseki-server

After the startup status messages stop scrolling, send your browser to the same IP address you did before. You should see /BeatlesSessions listed as an available dataset. If you like, you can click the “query” action and run the default query, which asks for ten triples. (Click the dark gray triangle to the right of the query to actually execute it.) It won’t get any data, but it shouldn’t show an error, either, so you know that the query engine works with this dataset.

Load some triples into the new dataset

At the shell window, press ^C to end the server session and go back to the command prompt. With the following two commands, go back to the root directory, and before loading data with Jena’s tdbloader tool, use the riot tool to verify that the data file we’re about to load doesn’t have any syntax problems, because data load time is not a good time to find out about such problems:

cd
./apache-jena-4.6.1/bin/riot --validate BeatlesMusicians.ttl

You shouldn’t see any error messages.

Next, load that data into your new dataset by pointing the tdb.tdbloader command line tool at the data file and at the dataset’s assembler file (this is a single long command that I split up to show here, but pasting it as shown worked for me):

   ./apache-jena-4.6.1/bin/tdb2.tdbloader --tdb \
      ./apache-jena-fuseki-4.6.1/run/configuration/beatlesSessions.ttl \
      BeatlesMusicians.ttl

(Read more about riot, tdbloader, and their companion utilities at Working with Fuseki datasets from the command line. These will let you edit and perform other maintenance on the data loaded in Fuseki.)

Query the data

Start up the server again:

cd ~/apache-jena-fuseki-4.6.1/
./fuseki-server

Run that default query again, and this time you should see ten triples about the Beatles’ recording sessions.

Let’s try a more interesting query. Paul was known as the bass player but sometimes added guitar solos. On which songs? Paste the following into that query screen and run it to find out:

PREFIX s:     <http://learningsparql.com/ns/schema/> 
PREFIX i:     <http://learningsparql.com/ns/instrument/> 
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX m:     <http://learningsparql.com/ns/musician/> 
SELECT ?title WHERE { 
  ?song a s:Song ;
  i:leadguitar m:PaulMcCartney .
  ?song rdfs:label ?title . 
}

You will see a surprising number of songs where he played lead guitar. (Optional step: check out his amazing solo on “Good Morning Good Morning”. Be sure to wait for the last lick, after John sings “it’s time for tea and meet the wife”.)

Remember, what you see in your browser is not a SPARQL endpoint, but the HTML interface to one. There’s an important difference. To really test this as a SPARQL endpoint, paste the query above into a file on your local machine (or any machine with web access) called paulquery.rq and then enter the following at the machine’s command prompt, substituting the Ipv4 address that you copied above into the URL:

curl --data-urlencode "query@paulquery.rq" \
   http://12.345.678.90:3030/BeatlesSessions/sparql

It should display a JSON version of the query results. (You can learn how to customize this behavior in my blog posting Curling SPARQL.)

Your own SPARQL web server

If it works with curl, it will work with all kinds of other tools, letting those applications take advantage of the data you provide over your new SPARQL endpoint. A few more points:

Be careful in the options you pick when setting this up, because some can get expensive. I copied this from one of the setup pages: “Free tier: In your first year includes 750 hours of t2.micro (or t3.micro in the Regions in which t2.micro is unavailable) instance usage on free tier AMIs per month, 30 GiB of EBS storage, 2 million IOs, 1 GB of snapshots, and 100 GB of bandwidth to the internet.” The Amazon EC2 T2 Instances page says that a t2 micro instance costs $0.0116 per hour, which works out to about $1.95 per week. Of course, if you want to scale way up and host a ton of data on a faster instance, the more expensive options are available.
That being said, forgetting about it for a year and then owing AWS a hundred bucks would be no fun. Remember to stop your instance when the time is right and to check the billing management screen ever now and then.
The EC2 Exercise 1.1: Host a Static Webpage article mentioned above explains how to add a regular Apache web server to your EC2 instance so that you can host static web pages from your new EC2 instance.

The most important thing is that you can use some robust open source software to create a SPARQL endpoint that costs practically nothing and is available to everyone on the Internet. That provides some big opportunities for standards-based data publishing.

Comments? Reply to my tweet announcing this blog entry.

More Picasso paintings in one year than all the Vermeer paintings?

Bob DuCharme — Sun, 25 Sep 2022 12:09:00 +0000

Answering an art history question with SPARQL.

Sometimes a question pops into my head that, although unrelated to computers, could likely be answered with a SPARQL query. I don’t necessarily know the query off the top of my head and have to work it out. I’m going to discuss an example of one that I worked out and the steps that I took, because I wanted to show how I navigated the Wikidata data model to get what I wanted.

On a recent trip to Dublin my wife and I went to Dublin’s wonderful National Gallery of Ireland. Among other paintings we saw Vermeer’s Woman Writing a Letter, with her Maid and Picasso’s Still Life with a Mandolin.

Seeing any Vermeer is a treat because there are so few of them around, and the way he depicts light makes for a huge difference between seeing a picture of the painting and seeing the real thing in front of you. (Remember, when you see these dumb discussions about AI-generated “paintings”: we can discuss whether they’re art or not, but they’re not paintings if there is no paint. They’re PNG and JPG files. If you compare the image above with the Vermeer hanging on the wall at the National Gallery of Ireland you’ll see what a tremendous difference that can be.) The Picasso was also great to see live because it was from his more colorful late cubist period; while some of his related collages included bits of wall paper, for this one he painted wallpaper-like patterns onto the canvas.

We know that Picasso was very prolific for many decades. This led me to wonder: was there any single year of Picasso’s career where he produced more paintings than Vermeer produced in his whole life? (Judging, in both cases, by surviving paintings that we have record of.)

The Wikipedia page for Vermeer tells us that “only 34 paintings are universally attributed to him today”, so I didn’t need SPARQL for that. The question for to me answer was this: were there any years where Picasso painted more than 34 paintings?

What triples say “Picasso made this painting”?

First I had to identify how Wikidata tells us that Picasso painted a given painting. I started with one of his most famous ones and clicked Wikidata item on the left side of the Guernica (Picasso) Wikipedia page. This showed me that Q175036 is the Wikidata identifier for this painting. I knew that the Wikidata triples with subjects that build on this ID would provide some good clues about developing a query that could count up his paintings per year.

What triples say “It’s a painting”?

I didn’t want to count up all his artworks per year, but just his paintings, so I entered the following query and executed it to see what class Guernica was an instance of. (Note that instead of using rdf:type or a as a property meaning “is an instance of”, Wikidata uses wdt:P31. Being reminded of this was part of my navigation around the Wikidata data model that I mentioned above.)

SELECT * WHERE {
  wd:Q175036 wdt:P31 ?class .
  ?class rdfs:label ?name .
  FILTER (lang(?name) = "en")
}

This showed that it is an instance of wd:Q3305213, or “painting”.

What triples say “It’s by Picasso”?

I went to the Wikipedia page for Picasso, picked Wikidata item, and saw that Picasso’s Wikidata identifier is WQ5593.

Next, I did a very simple query for all the data about the painting Guernica:

SELECT * WHERE {
  wd:Q175036 ?p ?o 
}

The result of this query included “wdt:P170 wd:Q5593”. If wd:Q5593 is Picasso, what is wdt:P170? This is easy enough to find out when executing the query with the Wikidata SPARQL endpoint HTML form: I just clicked on this name in the query result and it showed me that wdt:P170 means “creator”.

What triples say what year a painting was created?

The Wikipedia page for Guernica says that it was created in 1937. The earlier result of asking for all the triples about the painting showed that it has a wdt:P571 value of “1 January 1937”, where wdt:P571 means “inception.”

What paintings in what years?

Next, I used this query to list all the paintings by Picasso and the dates they were created:

SELECT * WHERE {
  ?painting wdt:P31 wd:Q3305213 ; # it's a painting
  wdt:P170 wd:Q5593 ;             # by Picasso
  rdfs:label ?title ;
  wdt:P571 ?inceptionDate .
  FILTER (lang(?title) = "en")
}

This listed them, but the Wikidata endpoint interface was displaying dates like 1913-01-01 as “1 January 1913” (with a suspicious amount having that “1 January”, so that may be a default when the month and day were unavailable). I just wanted the year if I was going to look for total paintings per year. I eventually realized that the date values were in ISO 8601 format, so I tried pulling out the year values with this query:

SELECT * WHERE {
  ?painting wdt:P31 wd:Q3305213 ; # it's a painting
  wdt:P170 wd:Q5593 ;             # by Picasso
  rdfs:label ?title ;
  wdt:P571 ?inceptionDate .
  BIND(substr(?inceptionDate,1,4) AS ?year)
  FILTER (lang(?title) = "en")
}

The dates still looked inconsistent, so I stored that query in the file pquery1.rq and used curl to run the query from my shell command line so that I could see the raw result:

curl --data-urlencode "query@pquery1.rq" https://query.wikidata.org/sparql

That showed me that the dates weren’t just arranged in ISO 8601 format—they were actually typed as ISO dates, so I revised the query above to convert those to regular strings before pulling out the year value with this query, and the ?year values came as the four-digit numbers I wanted to see:

SELECT * WHERE {
  ?painting wdt:P31 wd:Q3305213 ; # it's a painting
  wdt:P170 wd:Q5593 ;             # by Picasso
  rdfs:label ?title ;
  wdt:P571 ?inceptionDate .
  # added str() call to following
  BIND(substr(str(?inceptionDate),1,4) AS ?year)
  FILTER (lang(?title) = "en")
}

How many Picasso paintings per year?

I wasn’t really interested in the painting titles or their month and day of inception. I had everything I needed to answer my original question: how many paintings did Picasso do each year?

SELECT ?year (COUNT(?painting) AS ?paintingsInYear) WHERE {
  ?painting wdt:P31 wd:Q3305213 ; # it's a painting
  wdt:P170 wd:Q5593 ;             # by Picasso
  wdt:P571 ?inceptionDate .
  BIND(substr(str(?inceptionDate),1,4) AS ?year)
} 
GROUP BY ?year
ORDER BY DESC(?paintingsInYear)

Here are the first few rows of the results:

year    paintingsInYear
1901	52
1906	33
1908	31
1909	30
1905	25
1914	24
1903	23

So there’s the answer: we know of more Picasso paintings from 1901 than we know of Vermeer paintings from his whole life, and in 1906 Picasso came close to the Vermeer total. The first decade of the twentieth century was a very busy year for Picasso. (I then found a website showing his paintings by year; the 1901 page is interesting.)

The eye icon dropdown “Display result as” menu on the left side of the Wikidata Query Service page offers other ways to visualize the data. I changed the ORDER BY line in the last query to sort by the ?year value, ran the query, and then picked “line chart” from the dropdown and got this graph of the number of Picasso’s paintings per year:

This makes it even clearer how busy he was in the first decade of that century.

There are other display types, and of course, many other painters. There is a lot more fun to have here!

The most difficult part of creating such a query is the cryptic nature of the entity and property IDs: a single letter followed by a few digits. If the resources and properties used more readable names such as “Guernica (painting)” and “creator” instead, it would be more intuitive and easier to write queries—for those of us who speak English. But, Wikidata is designed to be usable by everyone in the world, not just the English speakers, and that’s a good thing. I won’t complain.

One more note: I included a digital-humanities tag with this post because it’s about using technology to answer an art history question. The field is often about accumulating data from different sources so that people can identify new patterns, but as the data in Wikidata accumulates more and more, there are more and more great things we can do with this wonderful source.

Comments? Reply to my tweet announcing this blog entry.

Learn RDF in Y minutes

Bob DuCharme — Sun, 28 Aug 2022 17:10:00 +0000

Where X = RDF

I have always loved the website Learn X in Y minutes, which provides short crash courses in several dozen programming languages plus additional topics such as set theory and git. Its home page tells us “Take a whirlwind tour of your next favorite language”; I’ll bet it’s especially popular with applicants on their way to job interviews where languages that are new to them are in the job description.

I have planned to add a SPARQL page, and I still haven’t. Four years ago they didn’t even have an SQL page, so as groundwork for a future SPARQL page I converted an old blog entry of my own SQL quick reference into a Learn SQL in Y minutes page for them. That has since been translated into Spanish, Italian, Russian, Turkish, and Chinese.

More groundwork: I have just created a Learn RDF in Y minutes page that shows some Turtle syntax and a few basics of RDFS. The “Further Reading” section at the end points to my What is RDF? and What is RDFS? blog entries, which are more detailed introductions, but I hope that this taste of RDF’s value on the Learn X in Y minutes site helps to spread the word of RDF’s potential value to a broader audience.

Comments? Reply to my tweet announcing this blog entry.

SPARQL and Instacart's Knowledge Graph

Bob DuCharme — Sun, 31 Jul 2022 13:05:00 +0000

Managing data quality.

Two recent articles describe a fascinating use of SPARQL to improve data quality in a knowledge graph at the successful grocery delivery service Instacart. On Reliability Scores for Knowledge Graphs (pdf) is a short paper submitted to the 2022 ACM Web Conference in Lyon and a longer piece on Instacart’s tech blog is titled Red Means Stop. Green Means Go: A Look into Quality Assessment in Instacart’s Knowledge Graph.

The abstract from the Web Conference paper gives an overview of the goal:

The Instacart KG is a central data store which contains facts regarding grocery products, ranging from taxonomic classifications to product nutritional information. With a view towards providing reliable and complete information for downstream applications, we propose an automated system for providing these facts with a score based on their reliability. This system passes data through a series of contextualized unit tests; the outcome of these tests are aggregated in order to provide a fact with a discrete score: reliable, questionable, or unreliable. These unit tests are written with explainability, scalability, and correctability in mind.

They “propose an automated system” that the tech blog piece shows is successfully in production. To quote more from the Web Conference paper’s introduction:

The Instacart KG contains information regarding products, recipes, and various product attributes, together with millions of contextual facts regarding these entities… Due to their large scale it is infeasible to curate such graphs by hand. Because of this, automated quality control mechanisms are important to ensure KGs contain valid information. Often KGs are created through a series of automated ETL processes which analyze both structured and unstructured data from a variety of sources to generate facts for the graph. This automation, combined with questionable source data, can cause KGs to acquire noise in the form of incorrect statements during their build processes. This noise can present itself in a variety of ways: incorrect product attributes can lead to negative storefront interactions, and noisy training sets can lead to less precise machine learning models. This has led to much work regarding quality assessment, error detection, and error correction in knowledge graphs.

As the tech blog put it, this system “helps us preemptively discover and flag flaws in our data which can then be corrected at the source [and] acts as a basic guardrail which prevents noisy and unreliable data from being published and corrupting downstream processes”.

They store their knowledge graph data as RDF triples in AWS Neptune. They evaluate and record the quality of facts with the following steps:

Use SPARQL to retrieve a set of data such as nutritional information.
Run the retrieved data through a series of Python unit tests designed for that dataset and log the results.
Tag facts as being either reliable, questionable, or unreliable.
Use SPARQL Update to record the results in the named graphs ReliableKG, QuestionableKG, and UnreliableKG.

With this system in place, downstream applications within the company can use more reliable data or just more data as appropriate for their needs. According to the Web Conference paper, “It is trivially easy to restrict a KG query to only select data which is at or above a certain reliability score”.

The tests in step 2 might flag something that is marked as Vegan but not Vegetarian so that someone can check whether it really is Vegan and set its Vegetarian value to True if so. The Web Conference paper includes other examples of how different classes of tests such as identification of outliers (for example, a dessert with an abnormally large amount of protein per serving, or items with an unreasonable sugar-carbohydrate ratio) led to more data quality. Because of this paper’s academic orientation, it also includes an “Impact Analysis” section about how they quantified the improvements to data quality, as well as references to previous academic work on data quality.

According to the tech blog, another benefit of their pipeline is that the system can pass the logs “to upstream data providers to make it easier to find and correct data inaccuracies at the source”.

Another part of their knowledge graph that provides metadata about the products is a taxonomy that, according to the article’s author Thomas Grubb, is represented in RDF as rdfs:class instances with rdfs:subClassOf relationships. This taxonomy drives some of the rules used to identify data problems. It also provides input to machine learning steps that help to identify new relationships about items; this process uses word embeddings (which I described in Document analysis with machine learning) and the k-Nearest Neighbors algorithm to identify taxonomy classifications based on product names.

It’s great to see SPARQL play such an important role in a powerful, useful system that takes advantage of several other interesting technologies. I especially like seeing their use of SPARQL Update—between the “QL” in “SPARQL” and the way that Wikidata is driving much of SPARQL’s current popularity, many people don’t realize that SPARQL is not a read-only technology. I also loved seeing a well-known brand name use and publicize SPARQL’s power, as you can see in this tweet:

Comments? Reply to my tweet announcing this blog entry.

Generating websites with SPARQL and Snowman, part 2

Bob DuCharme — Sun, 19 Jun 2022 13:05:00 +0000

With Rhizome's excellent ArtBase SPARQL endpoint.

In part one of this two-part series, we saw how the open source Snowman static web site generator can generate websites with data from a SPARQL endpoint. I showed how I created a sample website project with its snowman new command and then reconfigured the project to retrieve a list of artists from the Rhizome ArtBase endpoint, a repository of data about digital artworks since 1999. Here in part two I will build on that to add lists of artists’ works with links to Rhizome pages about them.

Add lists of artists works with links to more information

To add these lists of the artists’ works under their names, I started by removing the last two lines of the project’s views.yaml file (which was generated by the original snowman new command) and the templates/static.html file that the last line pointed to because I didn’t need that additional view:

views:
  - output: "index.html"
    query: "index.rq"
    template: "index.html"
  - output: "static/index.html"
    template: "static.html"

The Snowman github readme file tells you more about views.

A lot of incremental development in Snowman consists of adding to a query such as the queries/index.rq one that I started editing in part one and then editing the corresponding display template to take advantage of the new parts of the query. I gradually worked the query in queries/index.rq up to the following. It asks for artist names and their works that have “Flash” in their list of tags:

PREFIX rt: <https://artbase.rhizome.org/prop/direct/>
SELECT DISTINCT ?artistName ?artist ?searchTag WHERE {
   BIND("Flash" AS ?searchTag)
   ?artwork rt:P29 ?artist . 
   ?artist rdfs:label ?artistName .
   ?artwork rt:P48 ?artbaseLegacyTags .
   # Compare lower-case versions of both to make it case-insensitive
   FILTER CONTAINS(LCASE(?artbaseLegacyTags),LCASE(?searchTag))
}
ORDER BY (?artistName)

A few notes about this query:

An artwork’s rt:P48 value is a comma-delimited list of tags that have been assigned to it. (Some artworks in the dataset do not have tags assigned, so because this triple pattern is not optional, this query would not retrieve any of those.)
I learned about which properties (such as rt:P48) do what mostly through exploratory queries and guesswork. Visiting URLs like https://artbase.rhizome.org/wiki/Property:P48 would then show me how good my guesses were.
I could have just put "Flash" as the second parameter to CONTAINS() in the FILTER line instead of storing it in a ?searchTag variable and referencing that. Storing it in a variable at the top of the query made it easier to change to other values to look for other kinds of works, as we’ll see below.

In the followup to the query revision above, the new version of the template/index.html display template shown below has three new things:

A slightly different title.
The artist name in an h2 subhead element with the search value (for example, “Flash”) appended.
A Snowman include function to insert more content. It names an HTML template to format the inserted content, a query to generate values for the new template, and a parameter to pass to the query: the ?artist value (a URL) retrieved by the main query above.

{{ template "base" . }}
{{ define "title" }}Rhizome Artbase Artists and Works {{ end }}

{{ define "content" }}
<h1>Rhizome Artbase Artists and Works</h1>
<ul>
    {{ range . }}
<h2>Artist: {{ .artistName }} ({{ .searchTag }} and other works)</h2>
{{ include "artistsWorks.html" (query "artistsWorks.rq" .artist.String) }}
{{ end }}
</ul>
{{ end }}

Below is the queries/artistsWorks.rq query referenced by the include function above. The artist value passed to it by the template above is plugged in using the <{{.}}> construct, which I believe is Go template syntax. (I tried to learn more about it, but It’s difficult to do web searches for strings like that. You can see another demonstration of it in Snowman’s inline-queries example project.)

PREFIX rt: <https://artbase.rhizome.org/prop/direct/>
SELECT DISTINCT ?workTitle ?creationDate ?artworkPage ?artbaseLegacyTags WHERE {
  # r:Q676 is the artist Andy Cox if I need to sub it in next line for testing
  ?artwork rt:P29  <{{.}}> ;    # artwork by artist
           rdfs:label ?workTitle;
           rt:P26 ?creationDateTime .
  OPTIONAL { ?artwork rt:P48 ?artbaseLegacyTags . }
  # Don't need full ISO date value; just yyyy-mm-dd
  BIND(SUBSTR(str(?creationDateTime),1,10) AS ?creationDate)
  ?artworkPage schema:about ?artwork;
               schema:isPartOf <https://artbase.rhizome.org/>.
}
ORDER BY (?workTitle)

I left the qname for one of the artists in a comment near the top of the query because I sometimes replaced the <{{.}}> with that qname when working out other parts of the query logic.

Remember that the include function mentioned both this new queries/artistsWorks.rq file and the template file that goes with it to format the result: template/artistsWorks.html.

<table>
  <tr><th width="200">title</th><th width="100">creation date</th><th>tags</th></tr>
    {{ range . }}
    <tr>
      <td><a href='{{ .artworkPage }}'>{{ .workTitle}}</a></td>
      <td>{{ .creationDate}}</td>
      <td>{{ .artbaseLegacyTags }}</td>
    </tr>
        {{ end }}
</table>

This creates a table for each artist’s works with a row for each one. The first cell of each row uses the URL stored in the ?artworkPage value retrieved by the artistsWorks.rq query to create a link to that page—for example, to this page for one of the retrieved works.

Once the additions and modifications have been made to the files described so far, a snowman build creates a new site/index.html file with the work lists under each artist’s name.

Looking more stylish

I added some simple CSS, but first, in the templates/layouts/default.html file in the project, I took the / out of the following line so that the generated index.html file would look for style.css in the same directory:

<link rel="stylesheet" href="/style.css">

There was already a style.css file in the project’s static directory. I replaced its contents with the following minimal CSS:

{ font-family: arial,helvetica; font-size:12pt; }

body {
    margin: .25in .5in .25in .5in; /* t,r,b,l */
    font-family: arial,helvetica; 
}

th {
    text-align: left;
    background: lightgray;
}

Another snowman build then created a version of the page that looks like the one that I previewed in part one of this series:

Query for 3D works

I mentioned how I stored the string “Flash” in the ?searchTag variable of the queries/index.rq query to make it easier to have this query search for artwork tagged with other values. After changing this variable’s value to “3D” and doing another build, the top of the index.html file looked like this:

The image at the top of this blog entry is from 1+1+1+1+1+1+1+1+1+1+1+1 by by Grégory Chatonsky, one of the works tagged as 3D.

The search for the keyword is just a simple substring search of the CSV list. If a work had been tagged with “3DogNight”, that also would have been retrieved in the search for “3D”. For a more serious search of tags, I would make a copy of the keyword list that was not only all lower-case but also had spaces removed and began and ended with commas “,like,this,”. Then, a search of that for a version of ?searchTag enclosed by commas such as “,3d,” would be more accurate.

Possible next steps

I zipped up my artbase snowman project and made it available so that you can unzip it in your own Snowman examplesdirectory and try it out. The “Getting started with Snowman” section of the Snowman home page has brief descriptions of other projects included in that examples directory as part of the Snowman distribution. Each of those demonstrates other features that you can incorporate into your own Snowman website projects. One included example that is not listed there is nested-lists-with-single-query, which shows a way to “render nested lists without the need of multiple queries or views”.

You can apply these features to you own copy of my artbase project, or you can apply them to you own new Snowman projects that you create with snowman new. Let me know how it turns out!

Comments? Reply to my tweet announcing this blog entry.

Generating websites with SPARQL and Snowman, part 1

Bob DuCharme — Sun, 22 May 2022 11:20:00 +0000

With Rhizome's excellent ArtBase SPARQL endpoint.

Snowman is an open-source project that generates static web sites from data served up by SPARQL endpoints. The history of the web is full of sites generated from relational database back ends, so it’s nice to see this significant step toward doing it with RDF data.

Snowman is written in the Go programming language. The Hugo tool that I use to generate this website is also written using Go, and as with Hugo, no knowledge of Go is required to use Snowman. (If you do learn some Go, it’s pretty cool.)

I built a website around Rhizome’s ArtBase project to get to know Snowman better. Rhizome, as their home page describes, “is an archive of born-digital artworks from 1999 to the present day” affiliated with The New Museum in New York City. When you think of museum art preservation work, you usually think of preservationists dealing with fading paint colors and cracks in artwork; the Rhizome project is doing the difficult work of maintaining an infrastructure to present older computer-based art that often relies on obsolete technology such as Flash.

And, Rhizome makes the data about their collection available as a SPARQL endpoint! It has good documentation that links to their endpoint’s HTML interface in addition to describing it. It does not mention the actual endpoint URL, which is https://query.artbase.rhizome.org/proxy/wdqs/bigdata/namespace/wdq/sparql, but the HTML interface does something related that is very handy: after you run a query on the HTML front end, the “</> Code” link in the upper-right of the results displays an escaped version of the query with the actual endpoint URL. You can pass this to curl or other tools to build applications around this data.

In Snowman, a given website is built around a specified endpoint, and the set of files used to create that website are known as a project. The github readme file does a nice job of explaining all the pieces of a project and how they fit together. This includes a writeup of the snowman new command, which generates a skeleton project that you can modify to use your own data and presentation. (The readme also describes the straightforward process for installing and building Snowman.) In this two-part series I will walk through the steps I used to create a web page listing artists and their works where the work had been tagged in the data with a particular keyword such as “Flash” or “3D”.

Create, load, and view a sample project

The Snowman project includes an examples directory with several projects that you can explore to learn more about Snowman’s features. The following command from within that directory created an artbase project as a sibling of the other examples. (As you may have guessed from the ../ part, after you build Snowman the snowman binary is in the parent directory of of examples.)

 ../snowman new --directory="artbase"     # directory should not already exist
 Your project has been created in: artbase
 You can now run:
 cd artbase
 snowman build
 snowman server

The “you can now run” commands that it suggests assume that snowman is in your path. If not, point to it like I did in the ../ call to it above.

The snowman server command suggested by the snowman new output started up a server at http://127.0.0.1:8000/, where I could see the results of the site/index.html file generated by snowman build. Instead of running the server, you can just load the site/index.html file created by snowman build directly into your browser, which was what I did for most of my development. The advantage of the server is the ability to use features like AJAX requests and fancier JavaScript things that won’t work with files loaded using file:// URLs.

Point the project to the ArtBase endpoint instead of the default one

The default site/index.html file created by snowman build (not to be confused with the site/static/index.html file that it also creates) tells us that the endpoint to query is specified in the snowman.yaml file, so in that file I changed the endpoint URL from “https://query.wikidata.org/sparql" to “https://query.artbase.rhizome.org/proxy/wdqs/bigdata/namespace/wdq/sparql". The default query created by snowman new in queries/index.rq just asks for any ten triples, so it should work with any endpoint. After revising the endpoint URL I did another snowman build, reloaded the browser, and then I saw ten triples from the ArtBase project instead of from Wikidata. (Sometimes I saw the same triples that I saw with the wikidata endpoint—triples defining the schema.org ontology that they both use. In the next step we will definitely see ArtBase triples in the result.)

Query for artist names instead of random triples

After developing the query below in the ArtBase endpoint’s HTML interface, I changed the default query in queries/index.rq to this query so that my Snowman project would ask the ArtBase endpoint for artists who had any artworks, in alphabetical order:

PREFIX rt: <https://artbase.rhizome.org/prop/direct/>
SELECT DISTINCT ?artistName WHERE {
  ?artwork rt:P29 ?artist . 
  ?artist rdfs:label ?artistName .
}
ORDER BY (?artistName)
LIMIT 250

The last time I checked there were 1,268 artists in the dataset, so the LIMIT line helped to speed the edit-reload cycle. A later version of this query will filter based on artwork tags as another way to limit the number of displayed artists.

Adjust the display template to use data from the revised query

You can see above that the revised queries/index.rq query binds values to an ?artistName variable, so I replaced the contents of the default templates/index.html file (which had a lot of other markup in it to demo various Snowman features) to the following so that the ?artistName values would get inserted where I wanted them. The Go template range keyword iterates through a list passed to it; in the following that will create a new li element inside the ul element for each ?artistName value:

{{ template "base" . }}
{{ define "title" }}Rhizome Artbase Artists{{ end }}

{{ define "content" }}
<h1>Rhizome Artbase Artists</h1>
<ul>
    {{ range . }}
<li>{{ .artistName }}</li>
    {{ end }}
</ul>
{{ end }}

After another rebuild and reload, the browser showed a bulleted list of the first 250 artist names.

In part two, we’ll see how to add a query that lists the work of artists for whom at least one artwork has a particular tag, such as Flash, and I’ll add CSS. Below is a screenshot of the eventual end result:

Each work title on the left is a link to a Rhizome page about the work so that you can see it along with a description and other metadata. Kriegspiel is one example from the illustration above. The Color Field Television image by Andrew Venell at the top of this blog entry is another artwork that the generated report links to.

The ability to do a hierarchical display of the returned result like this is a nice contribution to the world of RDF development, because SPARQL queries normally return either a flat table or triples. (You still see some repetition in the screen shot of where this is headed because the dataset stores the tags as delimited lists and some works have more than one such list.) Watch this space to see what I did to the Snowman project files to get this result!

Comments? Reply to my tweet announcing this blog entry.

Queries to explore a dataset

Bob DuCharme — Sat, 30 Apr 2022 08:09:06 +0000

Even a schemaless one.

I recently worked on a project where we had a huge amount of RDF and no clue what was in there apart from what we saw by looking at random triples. I developed a few SPARQL queries to give us a better idea of the dataset’s content and structure and these queries are generic enough that I thought that they could be useful to other people.

I’ve written about other exploratory queries before. In Exploring a SPARQL Endpoint I wrote about queries that look for the use of common vocabularies that might be used at a particular endpoint, and how getting a few clues led me to additional related queries. That blog post also mentioned the “Exploring the Data” section of my book Learning SPARQL, which has other general useful queries.

You can see those listed in the book’s table of contents; they often assume that some sort of schema or ontology is in use. A great thing about SPARQL and RDF, though, is that with no knowledge of a schema or any other clues about a dataset’s contents, simple queries can still let you explore that dataset to see what’s there. Today’s exploratory queries were not included among those that I described above.

Example output for each query uses the Beatles Musicians dataset that I described at SPARQL queries of Beatles recording sessions.

How many triples does this dataset have in all?

SELECT (COUNT (*) AS?tripleCount) WHERE {
   ?s ?p ?o
}

Definitely a hall of fame, classic query. Here is the result for the Beatles musician data after performing the query with the Jena arq command line query engine:

---------------
| tripleCount |
===============
| 4089        |
---------------

Show all the types being used

Never mind whether any types were declared; how many types are used? List them, but don’t repeat any.

SELECT DISTINCT ?type WHERE {
   ?s a ?type
}

The result with the Beatles musician data:

----------------------------------------------------
| type                                             |
====================================================
| <http://learningsparql.com/ns/schema/Song>       |
| <http://learningsparql.com/ns/schema/Musician>   |
| <http://learningsparql.com/ns/schema/Instrument> |
----------------------------------------------------

Count instances per type

Of the types that the previous query found being used, how many instances of each are there? This is useful when you are prioritizing what you’re going to do with the data.

SELECT  ?type (COUNT (?s) AS ?instanceCount) 
WHERE {
   ?s a ?type . 
}
GROUP BY  ?type

The result:

--------------------------------------------------------------------
| type                                             | instanceCount |
====================================================================
| <http://learningsparql.com/ns/schema/Instrument> | 180           |
| <http://learningsparql.com/ns/schema/Song>       | 293           |
| <http://learningsparql.com/ns/schema/Musician>   | 238           |
--------------------------------------------------------------------

Count the properties that each type uses

Of the types that were found above, how many different properties does each use?

SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?c)
WHERE {
   ?s a ?type . 
   ?s ?p ?o . 
}
GROUP BY ?type

Number of properties used in the Beatles data, by type:

----------------------------------------------------------
| type                                             | c   |
==========================================================
| <http://learningsparql.com/ns/schema/Instrument> | 2   |
| <http://learningsparql.com/ns/schema/Song>       | 182 |
| <http://learningsparql.com/ns/schema/Musician>   | 2   |
----------------------------------------------------------

The next query will show us why the Song class uses so many properties.

List properties per type

What are these properties that each type uses? This is also useful for prioritization. Note the similarities with and differences from the previous query.

SELECT DISTINCT ?type ?property
WHERE {
   ?s a ?type .
   ?s ?property ?o .
}
ORDER BY ?type ?property

The following is an excerpt from the middle of this query’s result, with <http://learningsparql.com/ns/schema/Song> reduced to s:Song to make it all fit better here. This sample shows that all the different instruments, with all their different spellings, were properties of each song. (Read more about how that worked in my SPARQL queries of Beatles recording sessions blog post.)

| s:Song | <http://learningsparql.com/ns/instrument/guiro>
| s:Song | <http://learningsparql.com/ns/instrument/guitar>
| s:Song | <http://learningsparql.com/ns/instrument/handbell>
| s:Song | <http://learningsparql.com/ns/instrument/handclaps>
| s:Song | <http://learningsparql.com/ns/instrument/harmonica>
| s:Song | <http://learningsparql.com/ns/instrument/harmonium>
| s:Song | <http://learningsparql.com/ns/instrument/harmonyvocals>

Have a query create a schema for this schemaless data

Consider that:

The dataset has no schema but we found types being used
We found properties associated with these types
Schemas are themselves datasets of triples
SPARQL lets you create triples

This all adds up to the ability to create a schema where there isn’t any. In fact, we can do it with a slight variation on the last query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

CONSTRUCT {
   ?type a rdfs:Class .
   ?property a rdf:Property .
}
WHERE {
  ?s a ?type .
  ?s ?property ?o .
}

Note how the WHERE clause of this query is identical to the one from the preceding SELECT query. Here is an excerpt of what it created with the Beatles session data:

s:Instrument  rdf:type  rdfs:Class .
s:Song  rdf:type  rdfs:Class .
s:Musician  rdf:type  rdfs:Class .
i:recorder  rdf:type  rdf:Property .
i:celesta  rdf:type  rdf:Property .
i:tabla  rdf:type  rdf:Property .
i:tenorsaxophone  rdf:type  rdf:Property .
rdfs:label  rdf:type  rdf:Property .
i:harmonica  rdf:type  rdf:Property .

We could go a little further by having the schema use the rdfs:domain and rdfs:range properties to associate the declared properties with the classes that the query found them with:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

CONSTRUCT {
  ?type a rdfs:Class .
  ?property a rdf:Property .
  ?property rdfs:domain ?type .
  ?property rdfs:range ?otype . 
}
WHERE {
  ?s a ?type  .
  ?s ?property ?o .
  OPTIONAL { ?o a ?otype }
}

Along with the schema triples you see above, this new version adds triples like these:

i:banjo  rdf:type    rdf:Property ;
        rdfs:domain  s:Song ;
        rdfs:range   s:Musician .

It also gives the rdfs:label property rdfs:domain values of s:Instrument, s:Musician, and s:Song, which isn’t quite right; as the RDFS spec tells us, “[t]he rdfs:domain of rdfs:label is rdfs:Resource”. The spec also tells us that “the resources denoted by subjects of triples with predicate P are instances of all the classes stated by the rdfs:domain properties”, which in the case of my example means that every instance with an rdfs:label property is an instrument and a musician and song.

We clearly don’t want to say that, but if you are creating a schema for a dataset that lacks one, CONSTRUCT queries like this can give you a big head start. Just run one or the other with the dataset and then edit the schema that it creates as you see fit.

Comments? Reply to my tweet announcing this blog entry.

Doing a podcast interview about technical writing

Bob DuCharme — Sun, 06 Mar 2022 12:10:00 +0000

History, tools, and more.

After listening to hundreds of podcast interviews over the years I finally got to be the subject of one myself. Nikhil Krishna interviewed me for the Software Engineering Radio podcast, which is sponsored by the IEEE.

It was titled Bob DuCharme on Creating Technical Documentation for Software Projects. I’m going to quote the episode page’s list of topics we discussed, but to practice one of the things I preached, I will convert that page’s description to a bulleted list:

The difference between different types of documentation and the audiences they target
The importance of using proper grammar and clarity in writing good documentation that people want to read
Other forms of documentation (images, video and audio)
Challenges of maintaining and updating documentation
Keeping documentation in sync with products
Toolchains for building documentation
History of software documentation tooling and standards

Another important topic we covered was working with other people in a tech organization such as developers and marketing people.

After my discussion of XML’s role in the history of technical documentation in the interview (basically, a retelling of this history of XML that I wrote several years ago) I was happy to see that the Software Engineering Podcast does offer an RSS feed for people to track the podcast guests and topics. You can find other blog entries that I’ve written on tech writing topics in the category documenting software in this blog. The podcast episode page has links to additional relevant material.

One thing I regretted forgetting to mention in the interview, when we were discussing writing style, was George Orwell’s classic essay Politics and the English Language. I had meant to recommend that everyone read it but pretend that the title is “Technology and the English Language”.

So if you’re interested in doing technical writing or just being involved with technical writing tasks, you might find this podcast episode useful.

Comments? Reply to my tweet announcing this blog entry.

Taking some RDF beyond what it could do in a relational database

Bob DuCharme — Sun, 27 Feb 2022 11:02:00 +0000

Part 2 of 2.

In my last posting I described Carnegie Mellon University’s Index of Digital Humanities Conferences project, which makes over 60 years of Digital Humanities research abstracts and relevant metadata available on both the project’s website and as a file of zipped CSV that they update often. I also described how I developed scripts to convert all that CSV to some pretty nice RDF and made the scripts available on github. I finished with a promise to follow up by showing some of the things we can do with RDF versions of this data that we can’t do (or at least, can’t do nearly as easily) with the relational version. And here we are.

Easier addition of new properties that only apply to a few instances of some classes

What if you want to store additional data about the abstracts, conferences or authors? For example, if you want to store the hash tags associated with the conferences? The Chesapeake Digital Humanities Consortium 2020 conference (<http://rdfdata.org/dha/conference/i170> in my RDF data) has a dha:url value of https://chesapeakedh.github.io/conference-2020. That’s the conference home page, and if I go there I see that the conference hash tag is #CDHC20. When I’m at a conference—or not there and wishing that I was—Twitter searches for the conference’s hashtag can tell me interesting things that are going on or about to go on. This means that a Twitter hashtag is a hook to additional information about the conference, as you can see with a search on #CDHC20.

Let’s say that you could only find hashtags for 15% of the conferences. If you were storing the full dataset in relational tables, is it worth adding a new column to the conferences table to store this value that will be blank for 85% of the rows? In this particular case, it’s not even up to me. I would have to convince the team at Carnegie Mellon to add this column to their conferences table and populate it.

With RDF, I don’t have to worry about any of this. I can create the data when I have it as more triples like this:

<http://rdfdata.org/dha/conference/i170> dha:hashtag "#CDHC20" .

(RDF geek note: Instead of storing the hash tag as a literal string value I was tempted to do it as the URL for the Twitter search because resource URIs as objects can then link to other resources. I left it as a string value because the same hashtag might be used with other social media such as Instagram.)

Linking to other data sets out there (Linked Data!)

I can also add triples of data that enrich the metadata stored with the project. For example, the RDF I created shows that seven works have a keyword value of http://rdfdata.org/dha/keyword/i6995, which has the label “TEI”. Wikipedia tells us that the Text Encoding Initiative is “a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s”. They’ve been putting classic works of literature, along with copious metadata, into XML ever since XML was a four-letter word.

If the Text Encoding Initiative has a Wikipedia page, then it also has triples in Wikidata. These show the project’s Twitter handle, its Library of Congress authority ID, its home page, and much more. Just as I added the hashtag value for the Chesapeake Digital Humanities conference above with a triple, I can add another triple that connects the Index of Digital Humanities Conferences URI for TEI to all that great information about it in Wikidata:

<http://rdfdata.org/dha/keyword/i6995> dha:wikidata <http://www.wikidata.org/entity/Q780920> .

This makes the available metadata about the seven Digital Humanities Conferences works tagged this way much richer.

Easy federation and integration of new data

This goal blurs a bit with “Linking to other data sets out there” described above, because if you can link to a dataset with a SPARQL endpoint such as Wikidata then you can send it a CONSTRUCT query and retrieve data from it to store with your local data. The “Using standards instead of ad-hoc namespaces” section of part one of this blog entry was another step toward this kind of integration, because much of the point of using shared vocabularies is the ability to connect your data to other datasets that use the same vocabularies.

Other data sources offer interesting potential connections to the Digital Humanities conference data. One is the Virtual International Authority File, or VIAF. This has some fairly official data about authors and their works that you can retrieve in RDF. Author names may not always be completely unique, but looking at this data I realized that many authors are self-disambiguating–if your name is “John Smith”, and you know that many other authors have that name, if your middle name is Francis you may choose to use “John Francis Smith” or some variation such as “J. Frank Smith” or “Jack F. Smith” as your author name to make it easier for people to find the work that you wrote.

The RDF that my script generated from the Carnegie Mellon data included this in appellations.ttl:

<http://rdfdata.org/dha/appellation/i13>
        rdf:type        dha:Appellation ;
        dha:id          "13" ;
        dha:first_name  "A. Charles" ;
        dha:last_name   "Muller" .

VIAF has A. Charles Muller at https://viaf.org/viaf/117299466/#Muller,_A._Charles,_1953-, with 117299466 being their database’s unique identifier for this author. We can use that identifier to create the URL https://viaf.org/viaf/117299466/rdf.xml and then download 111 triples about him. We can also download various versions of the entire VIAF dataset, but that is too many gigabytes for me to do some quick experiments with. If it was loaded into a triple store, a SPARQL query that concatenates the dha:first_name and dha:last_name values above could help to automate the connection of conference paper authors to VIAF records.

Inferencing: finding new facts and connections

Authors of the conference papers made up their own keywords to assign to their works instead of selecting from a curated taxonomy, so it’s one big flat list. I did a little curation myself to give the list some hierarchy that would make it easier to find relationships between more relevant papers.

There were over two dozen keywords that had some variation on “TEI” or “Text Encoding Initiative” as their keywords. In my github project’s newrdf directory I added some triples to the SKOS scheme I described in part one called keywordScheme. The modelTriples.ttl file in that directory begins like this:

@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .
@prefix dha:   <http://rdfdata.org/dha/ns/dh-abstracts/> .
@prefix dhak:  <http://rdfdata.org/dha/keyword/> .

dhak:r10001 a               skos:Concept ;
            skos:inScheme   dha:keywordScheme ;
            skos:prefLabel  "Text Encoding Initiative (TEI)" .

dhak:i1100 skos:broader dhak:r10001 . # generated tei"
dhak:i2639 skos:broader dhak:r10001 . # tei and structural markup"
dhak:i2641 skos:broader dhak:r10001 . # tei encoding"
dhak:i2642 skos:broader dhak:r10001 . # tei markup"

First, it defines a new SKOS concept called “Text Encoding Initiative (TEI)”. The triples that follow that say that each of the relevant SKOS concepts generated from the Carnegie Mellon CSV by my automated conversion has this new one as its skos:broader value, just as “dachshund” in an animal taxonomy might have a broader value of “dog” to group together the different breeds. After the dhak:i2642 triple shown above there are 22 more about other TEI-related keywords. (I was tempted to automate the creation of all of these by looking for a substring of “tei” in the generated keyword concepts, but existing keywords like “Wittgenstein” and “Frankenstein” showed me that this was a bad idea.)

The git repository where I stored all the files for this conversion project has a readme file that shows some queries demonstrating the value added by this additional data modeling of the otherwise flat keyword list. A SPARQL query for all the works tagged “tei” retrieves a list of 90 of them. A query for all works tagged with something in the taxonomic subtree of “Text Encoding Initiative (TEI)” finds 132, so adding a little bit of semantics in the form of explicit relationships between related topics made it possible to find more papers about the TEI. A third query in the readme counts how many TEI-related papers were submitted each year for results that could be turned into a chart of the TEI’s popularity at these conferences over time:

The “inferencing” here is the deduction, based on the little bit of modeling that I did, of connections that were not otherwise explicit between resources described by the dataset.

The triples in modelTriples.ttl that enable this, like the RDF triples about conference hash tags, demonstrate how RDF can add value to a dataset that is outside of the control of the person doing the adding. As long as the id values in the original database keep identifying the same things, we can turn them into URIs that let us connect new kinds of data to the original dataset. It’s another great example of the new possibilities that become available when you use RDF to store your data.

Comments? Reply to my tweet announcing this blog entry.

Converting Digital Humanities paper and conference metadata to RDF

Bob DuCharme — Sun, 30 Jan 2022 11:35:06 +0000

How and why.

I think that RDF has been very helpful in the field of Digital Humanities for two reasons: first, because so much of that work involves gaining insight from adding new data sources to a given collection, and second, because a large part of this data is metadata about manuscripts and other artifacts. RDF’s flexibility supports both of these very well, and several standard schemas and ontologies have matured in the Digital Humanities community to help coordinate the different data sets.

Unrelated to RDF, in late 2020 a project at Carnegie Mellon University released the The Index of Digital Humanities Conferences. As the project’s home page tells us, “Browse 7,296 presentations from 500 digital humanities conferences spanning 61 years, featuring 8,651 different authors hailing from 1,853 institutions and 86 countries”. These numbers have gone up since the original release of the project. The About page and Scott Weingart’s blog post about the project give more good background.

The presentation abstracts, along with the connections to their presenters and their affiliations, are a gold mine for Digital Humanities research. One of the project’s main menus is Downloads, which lets you download all the data used for the project. The “Last updated” message on that page gives me the impression that they update it several times a week, if not every day. The “Full Data” zip file that you can download from there has CSV files of all the tables in the project’s database.

According to the project’s Colophon, they store their data in PostgreSQL and built the interface with Django. I can’t blame them for storing the data as a relational database instead of RDF, precisely because tools like Django and Ruby on Rails make it so easy to generate nice websites from relational data.

Of course, though, I converted it all to RDF, so I’m going to describe here how I converted it—or rather, how I built a process to convert it, because I wanted an automated system that could easily be re-run when the CSV data to download gets updated. My next posting will describe the cool new things I could do with the data once it was in RDF, because “why bother” is an important question for any such project. Here’s a preview to whet your appetite:

Easier addition of new properties that only apply to a few instances of some classes
Linking to other data sets out there (Linked Data!)
Easy federation and integration of new data
Inferencing: finding new facts and connections

I put everything necessary to do the conversion and enhancements on github.

I could have loaded the CSV files into a locally running relational database and then used D2RQ as an intermediary layer to treat the relational data as triples. When the Index of Digital Humanities Conferences releases an updated version of their data, though, clearing out the relational data tables and then reloading the updated tables would have been a lot more trouble then just running the short scripts that I wrote, especially if the structure of any of those tables had evolved. And, part of the fun of the conversion was moving beyond the original model to take advantage of relevant standards for easier connection to other projects.

Converting the data

There were two reasons that I wanted the ability to re-run my set of scripts and queries to accommodate updated versions of the data. “Updated versions” could mean that some tables of data had new rows or revised rows, but I wanted to be able to handle new tables and columns as well. If the data models evolve, I want my output triples to reflect this evolution. (This has already paid off. When I first wrote up my notes on this conversion, the Index of Digital Humanities Conferences project had 22 tables, and now it has 23, and I did not need to revise any of my scripts to include the new table’s data.)

With three of the tables loaded into spreadsheets we can see how one table defines the connections between data in the other two the relational way:

The works_keywords.csv table currently has 13,730 rows. As you can see above, rows 2 and 3 of that spreadsheet tell us that the keywords with IDs 889 ("ead") and 2439 (“sgml-encoding”) have been assigned to work 103, “What’s Interesting for Humanities Computing About Whitman’s Poetry Manuscripts?” This database has nine tables whose sole job is recording relationships between other tables like works_keywords does for the works and keywords tables. (As you’ll see, RDF does a better job of expressing such relationships.)

I used the open source tarql tool to convert all the tables to RDF. Here are some excerpts from the initial conversion:

# from keywords.ttl
<http://rdfdata.org/dha/keyword/i889>
        rdf:type   dha:Keyword ;
        dha:id     "889" ;
        dha:title  "ead" .

# from works.ttl
<http://rdfdata.org/dha/work/i103>
        rdf:type        dha:Work ;
        dha:id          "103" ;
        dha:conference  <http://rdfdata.org/dha/conference/i2> ;
        dha:title       "What's Interesting for Humanities Computing About Whitman's Poetry Manuscripts?" ;
        dha:work_type   "3" .

# from works_keywords.ttl
<http://rdfdata.org/dha/works_keywords/i1>
        rdf:type     dha:works_keywords ;
        dha:id       "1" ;
        dha:work     <http://rdfdata.org/dha/work/i103> ;
        dha:keyword  <http://rdfdata.org/dha/keyword/i889> .

To convert whatever CSV files happened to be in the downloaded zip file, my makeQueries.pl perl script reads all of the CSV files that it finds in the dh_conferences_data subdirectory and:

If a file has no underscore in its name and is therefore not a list of relationships, the perl script uses a proper-cased singular version of the file’s name as a class name for the data it contains—for example, “Work” for the data in works.csv.
Creates the query that will drive tarql’s conversion of the CSV file. makeQueries.pl reads the property names from the CSV’s first line and uses them to create a SPARQL CONSTRUCT query that creates an instance of the class whose name it identified in the previous step. Each data row’s ID value (with an “i” prefix added) is used as the local name of the URI that represents that row’s resource. This gives the first work listed (“Writing about It: Documentation and Humanities Computing”) a URI of http://rdfdata.org/dha/work/i1, and the 103rd one, which is shown above, a URI of http://rdfdata.org/dha/work/i103 .
Writes the query to the dh_conferences_sparql subdirectory with the same filename as the input CSV file and an extension of rq.
Writes a line to standard out that tells tarql to read this new SPARQL query file, run it, and put the output in the dh_conferences_rdf subdirectory in a file with the same name as the query and an extension of ttl. The directions with the script say to redirect its output of all of these tarql calls to a shell script, so when the perl script is done you can run that shell script to do the actual conversion of all that CSV to RDF.

The makeQueries.pl perl script also has an array of foreignKeyFields so that it knows that when a line from one CSV file is referencing an instance of data in another, it should reference it with a URI. (Knowledge graphs!) So, for example, a value of “1” for a work’s conference (The Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, in Glasgow) is turned into the appropriate URI so that the triple about the “Writing about it” paper’s conference is this:

<http://rdfdata.org/dha/work/i1> dha:conference <http://rdfdata.org/dha/conference/i1> .

If the data model of the relational input data did include a new column of foreign key references, this would require a slight adjustment to the perl script to add that to this foreignKeyFields array.

Making the RDF better than the relational data

Once you have data as triples—any triples—you can use SPARQL CONSTRUCT queries to improve that data.

Using standards instead of ad-hoc namespaces

My conversion script puts a lot of resources in namespaces built around my domain name rdfdata.org. When possible, I’d rather that they use standard namespaces. For example, the script above created this in keywords.ttl:

<http://rdfdata.org/dha/keyword/i2641>
        rdf:type   dha:Keyword ;
        dha:id     "2641" ;
        dha:title  "tei encoding" .

If we’re using keywords to assign subjects to works, I’d rather store information about those keywords using the SKOS standard, so my keywords2skos.rq SPARQL query turns the above into this:

<http://rdfdata.org/dha/keyword/i2641>
        rdf:type        skos:Concept ;
        skos:inScheme   dha:keywordScheme ;
        skos:prefLabel  "tei encoding" .

Note that it’s not actually converting the http://rdfdata.org/dha/keyword/i2641 resource, but just adding new triples about it in the SKOS namespace. These triples are stored separately from the original, so we don’t have to load originals into a triplestore when we use this data in an application.

The conference and abstract data also assigned topics to the various papers, so I did a similar conversion with them, storing them in the SKOS dha:topicScheme scheme instead of the dha:keywordScheme one shown above that I used for keywords.

If I was creating a serious production application, I could take this further. For example, instead of using the property http://rdfdata.org/dha/ns/dh-abstracts/title to reference the abstracts’ titles, I could use http://purl.org/dc/elements/1.1/title, and there is probably an ontology for conferences out there that has defined some of these other properties. (The schema.org Event class looks like it could cover a lot of the latter.)

Improving the links between resources

As we saw above, the works_keywords.ttl RDF file that this process creates from the works_keywords.csv data ends up with triples like this, which tells us that works_keywords row i1 represents a link from work i103 to keyword i889:

<http://rdfdata.org/dha/works_keywords/i1>
        rdf:type     dha:works_keywords ;
        dha:id       "1" ;
        dha:work     <http://rdfdata.org/dha/work/i103> ;
        dha:keyword  <http://rdfdata.org/dha/keyword/i889> .

RDF lets us do better than this relational database style out-of-line linking. Instead of a “link” resource that references the two linked resources, why not just say in the data about work i103 that it has a keyword of resource i889? The createWorkKeywordTriples.rq query does just that, reading the above triples and creating a new workKeywordTriples.ttl file in the newrdf subdirectory that has triples like this:

dhaw:i103 schema:keywords dhak:i889 .

Once I’ve done that, I don’t even need the triples in the works_keywords.ttl file. They’re just an artifact of the data’s relational heritage. I also used the schema.org standard’s property schema:keywords to show that a given keyword was assigned to a given work. If I’m going to connect keywords to a work the RDF way, I may as well use a property from a well-known standard to do it!

A createWorkTopicTriples.rq SPARQL CONSTRUCT query does the same thing with the topic assignments that createWorkKeywordTriples.rq did with the keyword assignments.

What have we got?

Once we have made these improvements, we can run the following query to ask about the title, conference year, and any keywords associated with any works that mention Whitman in their title:

PREFIX dha:    <http://rdfdata.org/dha/ns/dh-abstracts/> 
PREFIX schema: <http://schema.org/> 

SELECT ?title ?conferenceYear ?keyword WHERE {
 ?work dha:title ?title ;
       dha:conference ?conferenceID ;
       schema:keywords ?keywordID . 
  ?keywordID dha:title ?keyword . 
  FILTER (CONTAINS(?title,"Whitman"))
  ?conferenceID dha:year ?conferenceYear . 
}

There is only one work, but because it has two different keywords assigned to it, the result shows up as two rows:

Next steps

The github repository’s readme file has a step-by-step enumeration of which scripts to run when, with less discussion than you’ve seen here. It also provides a preview of some of the things I’ll talk about next time when I demonstrate some of the things we can do with RDF versions of this data that we can’t do (or at least, can’t do nearly as easily) with the relational version.

Comments? Reply to my tweet announcing this blog entry.

17 years of my web bookmarks, with metadata

Bob DuCharme — Fri, 31 Dec 2021 12:58:00 +0000

Featuring "75 Bleeding-Edge Search Engines To Beat Google", and more!

Much of the original point of the web was not just linking from one page to another but also saving and managing links, ideally with some metadata. Because of this, all browsers give you some way to save a link to a web page as a bookmark, and they typically let you sort these into a hierarchical arrangement of folders.

Third-party apps have cropped up with various strategies for improving on the built-in bookmark management offered by browsers. I have used diigo since 2004 and del.icio.us before that, and a recent review of my 71 pages of bookmarks was like a tour of my own mind for 17 years. (I seem to remember migrating the del.icio.us bookmarks when I made the transition but only see a few dozen of them showing up together in the early days of my using diigo.)

The ability to tag diigo bookmarks makes it easier for me to link to batches of them from here. For example, 63 of my early links are about the very concept of linking and the related standards and implementations that were evolving at the time. These links about linking include links to entries from my first blog, Thinking About Linking", which I had on oreilly.com.

Reviewing the full bookmark collection showed some interesting patterns.

Link rot (or not)

Big respect to the organizations whose URLs still point to the same content they pointed to way back when:

The New York Times, like with this Messing Around With Metadata piece that I bookmarked in 2007.
The Economist, like with this Start making sense piece about the semantic web from 2008.
Lifehacker, like with this review of a Windows uninstaller that I bookmarked in 2007. (One name that came up a lot in my bookmark inventory was Gina Trapani. She wrote many, many pieces at Lifehacker that I found useful enough to bookmark.)
I was going to give special kudos to technology news company GigaOM, because when I started reviewing these links I found that after GigaOM acquired the media news publisher PaidContent they redirected PaidContent article URLs to gigaom.com URLs for the same articles, but these links have rotted away.

The plain domain name paidcontent.org does redirect to gigaom.com, which reminds me of another interesting pattern I saw: expired domain names, including those with specific technical names, were often taken over by Japanese or Chinese sites that seem to have nothing to do with the original content. One example is medianmusic.com, which was selling groceries in Chinese when I first checked during my bookmark review and now shows Chinese content that I don’t understand well enough to generalize about.

Some other companies who get neither the A+ of the companies above for link maintenance or a failing grade. Two examples:

O’Reilly. The URLs for the tons of content that was once on oreillynet.com (like to this What is screencasting piece) now just redirect to the oreilly.com home page. It is nice to see that O’Reilly Radar pieces like Recent conversations about online documentation and O’Reilly Tools of Change for Publishing pieces like ALA 2008: Librarians and Patrons Want More Openness still lead to the original articles.
Publishing industry newsletter The Gilbane Advisor, which still has the 2009 Bill Trippe piece Random House: Creating a 21st Century Publishing Framework but not several of the other pieces I had bookmarked.

Failing grades:

IBM. While reviewing my diigo links I made note of five or six IBM developerWorks articles that were still there ten or so years after being published, but as I write this, none of them are there anymore. I’ll give them credit for one thing, though: they paid us to write those articles! After looking over my contract for one of them I recently republished it here: Taxonomy management with SKOS.
Taxonomy management tool vendor Synaptica. They had many bookmarkable articles about taxonomy management on their synapticacentral.com site, but all the ones I looked for are no longer there.

Of course, you could add to all three lists above; I’m just basing them on my own bookmark review. I purged many of them from the collection before I started taking notes for this blog post.

The Wayback Machine is like a versioning system for the web that lets you see how just about any web page looked at earlier points in the web’s history. It has been so valuable that I just donated $10 to them while reviewing one of my links that use it. I have replaced several of my formerly dead diigo bookmarks with links to Wayback Machine versions, like this recipe for making candied ginger from fresh ginger.

Things I thought were going to be a bigger deal than they turned out to be

I already mentioned that I used to be especially interested in linking—not just the ability to jump from web page to relevant web page, but standards-based architectures being built around these ideas and the ability for applications to take advantage of these architectures. Twenty years ago I realized that it wasn’t going to play out that way, although of course various JavaScript libraries and related tools have let people create more sophisticated link implementations. Standardized metadata built in to the links? Not so much.
A dozen links tagged RDFa. With millions (billions?) of HTML pages now using JSON-LD to embed triples, I won’t complain about RDFa’s failure. The goal of machine-readable triples being embedded in HTML pages was achieved.
HTML5 and the bitter process of its development.
Chatbots. My two most recent bookmarks with this tag are to Chatbots Magazine, whose newest article is over two years old, and a 2018 piece on chatbotslife.com titled Chatbots: What Happened?
Google+. I can’t even link to my bookmarks there because I deleted them all during my purge.

Some of my bookmarks showed the rise in popularity of things that continue to be popular, such as cloud computing, Twitter, and electronic book technologies.

Miscellaneous observations

I had many bookmarks for:

Tasks that were difficult to do in Linux 13 years ago but are easier now.
Windows utilities that would now be outdated even if I still used Windows.
Things I no longer need to bookmark because a web search to find them is faster than finding the bookmark (for example, a web form that escapes and unescapes URLs).
Things in the category of “I should read this but don’t feel like it; I will tag it so that I can come back to it if I ever regret not reading it”—especially in the field of machine learning.
How did I find the book image shown at the beginning of this blog post? After I wrote my first draft, I searched my diigo bookmarks for clipart and found Openclipart, which I had tagged as opensource and clipart. I guess I use my diigo bookmarks more than I realize.

Adding this data to a personal knowledge graph

The idea of a personal knowledge graph is hot lately. A curated set of over 1,700 favorite bookmarks sounds like an excellent addition to one. You can export diigo bookmarks to CSV, so I did that and used tarql to convert all of my links and their associated metadata to 7,641 triples.

In diigo you can assign multiple tags to your bookmarks; I apparently assigned four different tags to 21 of them. When you do this, diigo outputs a given bookmark’s multiple tags as a single CSV list in the CSV output, so that the “tags” value for my bookmark for this cartoon about user interfaces is “Apple,Google,Comic,userInterface”. Luckily, tarql supports Jena’s apf:strSplit function, making it easy to split that list and create four different ex:tag triples for that bookmark. (That ex: namespace was just for the quick and dirty test. For a real application I would use dc:subject for the tags.) After I added this function to my conversion query, it created 583 more triples than it had before.

How did I find out that I had assigned four different tags to 21 bookmarks? With a SPARQL query after doing the conversion, of course. With this data in RDF I can look for patterns and connect those tags to keywords in a taxonomy if I want. I can also connect up the data to other datasets. For example, the query that drives tarql could convert tags to URIs from standardized subject collections. I had tagged two bookmarks as F1; these could be converted to the URI <http://cv.iptc.org/newscodes/subjectcode/15039001>, the IPTC subject code for Formula One racing, for easier connection to other content out there. There are all kinds of possibilities.

Comments? Reply to my tweet announcing this blog entry.

My command line OWL processor

Bob DuCharme — Sun, 21 Nov 2021 12:30:00 +0000

With most of the credit going to to Ivan Herman.

I recently asked on Twitter about the availability of command line OWL processors. I got some leads, but most would have required a little coding or integration work on my part. I decided that a small project that I did with the OWL-RL Python library a few years ago gave me a head start on just creating my own OWL command line processor in Python. It was pretty easy.

My goal was something that would read RDF files, do inferencing, and output any triples created by the inferencing. The heavy lifting is done by the OWL-RL library for the classic RDFLib Python library. The OWL-RL library was originally written by Ivan Herman and is now maintained by Ashley Sommer and Nicholas Car. (As you would guess from its name, this library implements the rule-based OWL profile known as OWL RL.) My script is short and simple enough that instead of putting it on github I’ve just pasted it below.

Testing it

In my recent blog posting You probably don’t need OWL, I wrote about an inferencing use case:

For example, in Trying Out Blazegraph (which only supports bits of OWL), I showed a dataset that had triples about various chairs and desks being located in various rooms, as well as triples about which rooms were in which buildings, but nothing about which furniture was in which buildings (or for that matter, what counted as furniture). I then used the RDFS rdfs:subClassOf property to declare that dm:Chair and dm:Desk were subclasses of dm:Furniture, and I also declared that my dm:locatedIn property was an owl:TransitiveProperty. With these additional modeling triples, a SPARQL query to an OWL processor that understood rdfs:subClassOf and owl:TransitiveProperty could then list which furniture was in which building. This little bit of OWL actually added some semantics to the model as well, because it tells us—and OWL processors—a little about the “meaning” of dm:locatedIn.

To try this example with my new command line processor, I didn’t even need to use SPARQL. I just stored the “Trying Out Blazegraph” sample data in a file called chairsAndTables.ttl and fed it to my script like this:

owl-rl-inferencing.py chairsAndTables.ttl

Here are the first three triples of the output:

<http://learningsparql.com/ns/data#chair15> a ns2:Furniture, ns1:Thing ;
    ns2:locatedIn <http://learningsparql.com/ns/data#building100> .

It inferred that chair 15 is an instance of the Furniture class (and of the Thing class) and that it’s in building 100. It also output triples about what buildings all the other chairs and tables were in, so I counted this as a successful test.

For another test, I was especially happy to see the script do the inferencing I expected from one particular example in my book Learning SPARQL. Example dataset ex424.ttl lists the name, instrument played, and birth state of six musicians without saying that any is a member of any class. Here are two examples:

d:m2 rdfs:label "Charlie Christian" ;
     dm:plays d:Guitar ;
     dm:stateOfBirth d:TX .

d:m4 rdfs:label "Kim Gordon" ;
     dm:plays d:Bass ;
     dm:stateOfBirth d:NY .

It also includes the following restriction class definitions, which specify conditions that qualify an instance as a member of the classes Guitarist, Texan, and TexasGuitarPlayer:

dm:Guitarist
   owl:equivalentClass
           [ rdf:type owl:Restriction ;
             owl:hasValue d:Guitar ;
             owl:onProperty dm:plays
           ] .

dm:Texan
   owl:equivalentClass
           [ rdf:type owl:Restriction ;
             owl:hasValue d:TX ;
             owl:onProperty dm:stateOfBirth
           ] .

dm:TexasGuitarPlayer
   owl:equivalentClass
        [ rdf:type owl:Class ;
          owl:intersectionOf (dm:Texan dm:Guitarist)
        ] .

To test my script’s ability to read different serializations, I split up ex424.ttl into ex424a.ttl, ex424b.nt, and ex424c.rdfbefore feeding them to the script like this:

owl-rl-inferencing.py ex424a.ttl ex424b.nt ex424c.rdf

The output included the following triples, so we know that it inferred that Charlie Christian was an instance of all three classes:

<http://learningsparql.com/ns/data#m2> a
        <http://learningsparql.com/ns/demo#Guitarist>,
        <http://learningsparql.com/ns/demo#Texan>,
        <http://learningsparql.com/ns/demo#TexasGuitarPlayer> .

It did not infer that resource m4, New York bassist Kim Gordon, was in either class. It did infer that Texas piano player Red Garland was a Texan, but not a Guitarist or a TexasGuitarPlayer, and it inferred that native Californian Bonnie Raitt was a Guitarist but not a member of the other two classes.

Combining this with other tools

The inferred triples may need some management after they’re materialized. If chair 15 gets moved from room 101 in building 100 to building 201 in building 200, we don’t want that inferred triple about it being in building 100 hanging out any more. Named graphs can help here, as I described in Living in a materialized world: Managing inferenced triples with named graphs. That shows how RDFLib lets you pipeline a series of queries and updates, letting you combine simple and complex operations into sophisticated applications. The ability to do OWL inferencing can contribute a lot to these pipelines.

Without taking advantage of RDFLib’s pipelining ability at the Python code level, you can do some pipelining right from your operating system command line, sending the output of my owl-rl-inferencing.py script to an Apache Jena tool such as riot.

Either way, I hope the script is useful to someone. Let me know!

The code

#!/usr/bin/env python3

# owl-rl-inferencing.py: read RDF files provided as command line
# arguments, do OWL RL inferencing, and output any new triples
# resulting from that.

import sys
import rdflib
import owlrl

if len(sys.argv) <  2:  # print directions
    print("Read RDF files, perform inferencing, and output the new triples.")
    print ("Enter one or more .ttl, .nt, and .rdf filenames as arguments.")
    sys.exit()

inputGraph = rdflib.Graph()
graphToExpand = rdflib.Graph()

# Read the files. arg 0 is the script name, so don't parse that as RDF.
for filename in sys.argv[1:]:   
    if filename.endswith(".ttl"):
       inputGraph.parse(filename, format="turtle")
    elif filename.endswith(".nt"):       
       inputGraph.parse(filename, format="nt")
    elif filename.endswith(".rdf"):       
       inputGraph.parse(filename, format="xml")
    else:
        print("# Filename " + filename + " doesn't end with .ttl, .nt, or .rdf.")

# Copy the input graph so that we can diff to identify new triples later.
for s, p, o in inputGraph:
    graphToExpand.add((s,p,o))

# Do the inferencing. See
# https://owl-rl.readthedocs.io/en/latest/stubs/owlrl.DeductiveClosure.html#owlrl.DeductiveClosure
# for other owlrl.* choices.
owlrl.DeductiveClosure(owlrl.OWLRL_Semantics).expand(graphToExpand)

newTriples = graphToExpand - inputGraph  # How cool is that? 

# Output Turtle comments reporting on graph sizes
print(f"# inputGraph: {len(inputGraph)} triples")
print(f"# graphToExpand: {len(graphToExpand)} triples")
print(f"# newTriples: {len(newTriples)} triples")

# Output the new triples (decode() is to omit "b'' " in output)
print(newTriples.serialize(format='turtle').decode())

Comments? Reply to my tweet announcing this blog entry.

You probably don't need OWL

Bob DuCharme — Sun, 17 Oct 2021 11:50:00 +0000

And if you do there's a simple way to prove it.

During the course of my recent blog posts What is RDF?, What is RDFS?, What else can I do with RDFS?, and Taxonomy management with SKOS, some readers wondered if I would do a “What is OWL?” followup. I recommended to one inquirer that he read pages 39-41 and 263 - 269 of Learning SPARQL; I think that provides a pretty good introduction to OWL’s history and how to do some of the set-based logic that was an important part of its original intent.

A recent blog entry by Irene Polikoff, a founder of my former employer TopQuadrant, has also inspired a lot of conversation about when people should or shouldn’t use OWL. Her entry’s title is pretty categorical: Why I Don’t Use OWL Anymore. I think that bits of OWL can be more useful than she does, but still less useful than many people do. I’ll get to some examples below.

Data modeling? Use RDFS

At its simplest level, data modeling is the identification and enumeration of the pieces of information that you want to keep track of and the relationships between them. A standards-based, machine-readable version of this enumeration is very valuable to application development. As I wrote in What is RDFS? and What else can I do with RDFS?, RDFS can do that pretty well. It does an especially good job for schema.org, one of the great success stories of RDF-based technology, as I described in the first of those two pieces. You can go beyond RDFS to add information about your data’s structures and potential relationships in even more detail, but as we’ll see, machine-readable descriptions of this information won’t do you much good unless you have tools that will read these descriptions and use them to contribute value to your applications.

Defining constraints on that data model? Use SHACL

OWL can go beyond RDFS to describe additional details about your classes and properties, but it can only rarely describe what counts as a valid instance of a class and what doesn’t. This has been a fundamental need of data processing for as long as people have been using data on computers: developers who write applications that use data don’t want to write lots of code to make sure that the data they read is what they’re really expecting. They want to assume that the processes that created that data already did this validation. SQL’s CREATE TABLE statements let you specify data types of and dependencies between table columns, not to mention which are required and which are optional; DTDs and later forms of schema do the same for XML.

RDF never really had this until the W3C standard SHACL, as I described in Validating RDF data with SHACL. Irene’s followup to her blog entry mentioned above is titled Why I Use SHACL For Defining Ontology Models, and it explains many of the advantages that SHACL brings. (She does write “I no longer used RDFS/OWL (besides declaring classes and subclasses)”, so she hasn’t completely replaced her usage of RDFS.)

Controlled vocabulary? Use SKOS

Last month in Taxonomy management with SKOS I described how taxonomies and thesauri are controlled vocabularies that typically let you store metadata about the vocabulary terms, including their relationships to each other. You could picture a taxonomy or thesaurus as a potentially large collection of terms arranged in a tree in which lower levels of the tree describe subsets of the higher levels. If we want to represent this all in RDF, should we do it as OWL classes? I say: no. This is not a nail for that hammer.

First of all, the lower levels of a taxonomy tree do not represent subsets of the higher levels. The tree’s nodes represent terms, not sets of things, and lower levels of the tree show more specific terms: for example, “collie” and “bulldog” as more specific versions of “dog” and “dog” as a more specific version of “mammal”. Heather Hedden, author of the leading introduction to taxonomy development, summed it up nicely in her blog post Differing Definitions of Ontologies: “ontology structures are meant to model data, not to organize taxonomy concepts that could be either generic (common nouns) named entities (proper nouns)”.

In a taxonomy, “Person broader than Employee” means that a book or other form of media about employees is also a work about persons. In an ontology, “Employee is a subclass of Person” lets you distinguish between properties that apply to all persons (family name, given name) and properties that apply to employees but not to persons (hire date, salary).

SKOS is itself an OWL ontology that defines a data model for storing controlled vocabularies and their metadata. It has commercial and open source support among popular vocabulary management tools. (Pinterest developed their own ontology for taxonomy management, but it draws on SKOS.) SKOS is a W3C standard that is specialized for this particular job. SKOS vocabularies and OWL ontologies can use each other as input; a straightforward SPARQL query can often create one from the other, but keep their different purposes in mind. The traction that SKOS-based tools have achieved over the years is a powerful argument to use this standard for vocabulary management.

But if you really need OWL…

If you really need OWL, prove it! Do something with your data and an OWL processor that would have been noticeably more difficult without that processor. This will demonstrate what value OWL brings to your data.

For example, in Trying Out Blazegraph (which only supports bits of OWL), I showed a dataset that had triples about various chairs and desks being located in various rooms, as well as triples about which rooms were in which buildings, but nothing about which furniture was in which buildings (or for that matter, what counted as furniture). I then used the RDFS rdfs:subClassOf property to declare that dm:Chair and dm:Desk were subclasses of dm:Furniture, and I also declared that my dm:locatedIn property was an owl:TransitiveProperty. With these additional modeling triples, a SPARQL query to an OWL processor that understood rdfs:subClassOf and owl:TransitiveProperty could then list which furniture was in which building. This little bit of OWL actually added some semantics to the model as well, because it tells us—and OWL processors—a little about the “meaning” of dm:locatedIn.

That was pretty easy. I think it’s a good general rule that if you want to demonstrate the value of a certain technology, show something that you can do with it that would have been a lot more trouble, if not impossible, without it. A query about data that is relevant to many different businesses, such as employee or facility data, is a great way to do this. (I always thought that Protégé’s famed pizza ontology was a little too cutesy of a demonstration domain—of course everyone likes pizza, but why not use a domain where there is an actual chance that people would use an ontology to manage the relevant data?)

The most visible pushback that I saw to Irene’s blog posts about not using OWL was Why We Use OWL Every Day At Triply from the Amsterdam-based company. Their explanations of OWL’s value focused on its role as human-readable documentation of modeling intentions, which is certainly valuable, but they did not point to any usage of OWL as machine-readable modeling instructions when I asked.

I am not done playing with OWL, and I still dream of making the following pin and wearing it to a conference where at least some of the attendees will get the joke:

Comments? Reply to my tweet announcing this blog entry.

Taxonomy management with SKOS

Bob DuCharme — Sun, 19 Sep 2021 11:47:00 +0000

Republishing an IBM developer works article.

In 2011, IBM developerWorks published an article that I wrote titled “Improve your taxonomy management using the W3C SKOS standard.” (They have always loved those “Get Better at This Thing” titles.) Several years later they took it (and a ton of other developerWorks content) down. I have republished it here as background for recent discussions about when OWL is appropriate to use and when it isn’t; more on that next month. I didn’t change anything but added a few comments in bold italics about my 2021 perspective on some of these issues. See also several other pieces that I’ve written about SKOS over the years.

Controlled vocabularies, taxonomies, and thesauri: What’s the difference?

A controlled vocabulary is a list of terms that define the potential values for something—for example, the possible subjects of a set of news stories or the official two-letter abbreviations of the states of the United States. A taxonomy is a controlled vocabulary arranged in a hierarchy to show relationships between terms. The possible subjects of a set of news stories is most likely this kind of controlled vocabulary, with “Acquisition” and “Executive hiring” as children of the hierarchy’s “Business news” node.

These relationships are metadata that indicate, for example, that a story about an executive being hired is a type of business news story or that a dachshund in an animal taxonomy is a kind of dog. When a taxonomy-aware image search engine returns a picture tagged “dachshund” to someone searching for “dog” pictures, it takes advantage of this metadata to help the searcher get greater value from the image collection.

A thesaurus is typically a taxonomy with additional metadata about each term such as alternative terms (for example, “mutt” for “dog”) and pointers to related terms that might or might not be in the same hierarchy (for example, “doghouse” for “dog”). People who specialize in the creation and maintenance of thesauri are usually known as taxonomists, perhaps because the term “thesaurist” sounds too much like “thesaurus” or maybe because “thesaurus” reminds people from outside the metadata management field too much of books of synonym lists used as writing aids, such as Roget’s Thesaurus.

Whether you manage a taxonomy to integrate business processes in an enterprise, to manage keywords assigned to content for more intelligent retrieval, or to manage the menus of a large web-based retail site, you might find that your taxonomy management tool stores data in a proprietary binary format that doesn’t migrate well to other tools. A standards-based way to represent this data can help you integrate vocabulary data from multiple sources while reducing your dependence on proprietary tools.

The Simple Knowledge Organization System (SKOS) is a W3C standard that builds on the W3C’s RDF, RDFS, and OWL specifications to provide a standard model for representing controlled vocabularies. You can use SKOS for flat lists and also for more structured controlled vocabularies with additional metadata such as taxonomies and thesauri.

Because SKOS is defined using the RDF model, it’s easy to read and create data in an XML format. (Not so much encouraging RDF/XML here as namechecking a standard that readers unfamiliar with RDF would have heard of.) Growing tool support for SKOS means that using it requires no knowledge of the related W3C standards, but the more you know, the more you can take advantage of the extensibility of SKOS to include customized metadata in your vocabularies that might not be part of the SKOS standard.

As organizations ranging from The New York Times to NASA to the UN Food and Agriculture Organization make their subject listings available in SKOS, this standard also makes it easier to reuse well-known vocabularies and to create connections between your content and other content that uses the same vocabularies.

Terms versus concepts and labels

Vocabulary management systems have always been structured to manage terms, along with relationships between terms and other metadata. SKOS takes a higher-level view of what you manage, which makes internationalization much easier. For example, an older system might store the term “dog” with a broader term of “mammal” and narrower terms of “dachshund” or “bulldog.” The term “mutt” would be a separate term, and “dog” would have what taxonomists call a use-for relationship to “mutt”—if someone assigning keywords to photographs wants to assign the word “mutt” to a picture of Lassie, the vocabulary application would direct him to use the word “dog” instead. The term “perro” could have a relationship “Spanish” to the term “dog,” and “chien” could have the relationship “French” to it, but a Spanish user wondering about the French term for “perro” might not be able to look this up without knowing that they’re connected by their relationship to the English term.

Another disadvantage of this arrangement is that the terms “mutt” and “perro” are as separate from “dog” as the term “cat” or “gato” (a Spanish term). Even though “mutt,” “dog,” and “perro” refer to the same thing, their relationships must be explicitly specified. Figure 1 displays these relationships in a diagram; solid-line arrows represent a “broader than” relationship (mammal to cat and dog; dog to bulldog and dachshund), and dotted-line arrows are labeled for the Spanish (“perro”) or French (“chien”) equivalents for “dog,” alternate terms in Spanish (“chucho”) and English (“mutt”) for “dog,” plus the Spanish (“gato”) for “cat.”

Figure 1. Sample label relationships in a pre-SKOS taxonomy

With SKOS, you manage concepts that have different kinds of labels, and each label might have a language associated with it. The most important label is the preferred label, and SKOS allows each concept to have only one of these in each language. A single concept could have an English preferred label of “dog,” a Spanish preferred label of “perro,” and a French preferred label of “chien.”

Another kind of label is the alternative label, which SKOS-based software might use to represent labels that are being tracked but not recommended. For example, the concept with an English preferred label of “dog” might have an English alternative label of “mutt” and a Spanish alternative label of “chucho.” Instead of being separate terms that must have their relationships explicitly typed, “dog,” “perro,” “chien,” “mutt,” and “chucho” all refer to the same concept, providing different information about that concept depending on the needs of each application. Figure 2 illustrates the information from Figure 1 rearranged as SKOS concepts, with fewer arrows and clearer relationships between the terms. (As with the earlier figure, solid-line arrows represent a “broader than” relationship.) The actual identifiers for each concept, which might be hidden under the covers by a vocabulary management application, are URIs.

Figure 2. Sample concepts relationship in SKOS

When you compare the two diagrams, you can see that in Figure 1, “perro” and “mutt” were just additional terms that “dog” pointed to, “bulldog” and “dachshund,” but in Figure 2 you can see that “perro” and “mutt” refer to the same concept while “bulldog” and “dachshund” are different concepts.

Concepts can have many kinds of relationships in SKOS besides “broader than.” The concept with an English preferred label of “dog” might have a “related” relationship with a “doghouse” concept in a different taxonomy. Because SKOS uses unique URIs as concept identifiers instead of the labels themselves, you can define relationships between a given concept and any concept in any accessible SKOS vocabulary in the world, even if it’s maintained by NASA or The New York Times.

The UN Food and Agriculture Organization’s AGROVOC thesaurus for food-related domains such as fishing and farming must serve a truly international audience. A single AGROVOC concept can have preferred labels in over a dozen languages and even more alternative labels because there is no limit to the number of alternative labels you can specify for a given concept from each language. SKOS uses concepts with label properties to make multi-lingual tracking of terms much easier than one of the older, term-based approaches to organizing thesaurus data would, and this in turn makes communication between people from different cultures about food issues much easier.

More metadata

Along with the preferred and alternative labels and relationships between concepts described above, SKOS lets you store a term’s definition, scope notes, history notes, and a variety of other properties about each concept. Because SKOS is defined using the W3C’s OWL standard for specifying ontologies, it’s very easy to define and use additional properties that are specific to your industry or business to the concepts in your vocabularies.

These properties can come from other data and metadata standards, such as the Dublin Core vocabulary, the Market Data Definition Language developed for the financial industry, or the Metadata Object Description Schema developed by the Library of Congress. They can also be properties that are specific to your company’s system and that no one else uses because they’re part of the added value for how you manage your information. For example, a pharmaceutical company might define a new “requires” relationship in an animal taxonomy to point to concepts in another taxonomy’s data about veterinary vaccines.

SKOS-based tools for editing and managing your vocabularies should understand that extensibility is part of this standard. Additional properties from outside of the SKOS specification should be part of their interface as you work with that data, showing up on the forms and reports along with the standardized SKOS properties.

More granular metadata: SKOS-XL

Although the OWL language used to specify SKOS has certain crucial differences from object-oriented approaches to data modeling, it has one important thing in common: You define a data model by declaring classes, subclasses, and properties (or, to use the object-oriented term, attributes) of those classes. The SKOS ontology defines a Concept class, and preferred labels, alternative labels, and relationships to other concepts are modeled as properties of that class.

You can assign all the metadata you want to a given concept, but SKOS provides no way to assign metadata to a specific label. What if you want to store data that describes the source of the label “chucho,” or when it was last edited, or who edited it?

To accommodate this situation, the W3C published the SKOS Extension for Labels (SKOS-XL) specification, in which the values for a concept’s preferred, alternative, and other labels are not strings but members of a new Label class defined by the extension specification. Being instances of a class, these labels can have all the metadata you want to assign to them, which gives you a lot more flexibility.

Easier metadata integration

Earlier I mentioned that because SKOS uses unique URIs as concept identifiers, you can define a relationship between a given concept and any other SKOS-based concept whose URI ID you know, whether it’s in the same taxonomy as a given concept or in a different taxonomy published on the web by a separate company. This ability is also great for a situation that falls between these two extremes: When different groups within the same enterprise have their own vocabularies to manage, integration of these vocabularies into a centrally managed single vocabulary can do more harm than good because vocabulary maintenance becomes more complex with the growing scale of data and the data must be revised to reach compromises between the needs of different groups. The marketing department and the repairs department might mean different things when they use the term “customer,” and they might have good reasons for doing so; forcing them both to use the same definition can reduce the vocabulary’s value for both of them.

With SKOS, you can define relationships between concepts from different vocabularies. Because of this, well-defined concept relationship metadata gives you the hooks to use vocabularies from different departments together without forcing you to revise and combine them all into a monolithic single vocabulary that doesn’t fully meet any group’s needs. The relationships can be standard SKOS relationships such as “related” or “broader” (for example, you might say that the marketing department’s concept of “customer” is broader than the repair department’s), but again, you can define your own customized relationships as well.

SKOS and the Semantic Web

When becoming interested in semantic technology, many worry that before they build their first application, they must learn the RDF data model, the various syntaxes for expressing it, the SPARQL query language, and how to model data with RDF schema and OWL. When you use a SKOS-based vocabulary manager, you most likely fill out forms and use typical user interface widgets to manage your data with no need to learn the base W3C standards that underlie SKOS, but if you choose to learn a little about them, you can get more out of your data. For example, you can use the SPARQL query language to ask questions that might not be part of your vocabulary management package, and as mentioned above, you can define new properties and even classes to keep track of more customized metadata.

You can also connect your data to a wider variety of data out there, whether it uses the SKOS ontology or not. The ability of the RDF data model to connect independently created data is what makes the Semantic Web a web, and the ability to combine datasets is an important payoff of this ability. For example, by making their SKOS-based subject header index freely available on the web, The New York Times lets other publishers use these subject headers for their own content, giving those publishers connections to related New York Times articles. More importantly, for The New York Times, it drives more traffic to their articles tagged with those subject headers.

After you’ve added some properties to your SKOS data and run a few SPARQL queries against it, you can think about defining new ontologies apart from SKOS (or finding other existing standard ontologies besides SKOS to extend) and take greater and greater advantage of Semantic Web technologies.

Tools

Any RDF tool that can edit data guided by a particular ontology can load the SKOS OWL ontology and let you create SKOS concepts and populate their properties with the appropriate metadata. For management of vocabularies by staff with no RDF background, several tools are available:

TopQuadrant’s Enterprise Vocabulary Net (EVN) is a commercial web-based collaborative system built around the SKOS data model for the management of controlled vocabularies across an enterprise. This has since evolved into TopBraid EDG, which focuses on a broader set of Data Governance tasks. I was happy to see that all of the remaining tools in this list are still around ten years after I originally wrote this piece.
PoolParty is a commercial thesaurus management and SKOS editor system that includes text mining and linked data capabilities.
The SKOSed plug-in for the Protégé ontology editor lets you edit thesauri represented in SKOS. Both SKOSed and Protégé are open source.
iQvoc is an open source tool for managing vocabularies that can import and export SKOS.
TemaTres is an open source vocabulary manager that can output vocabulary data as SKOS files.

Import and export of SKOS by vocabulary management tools should eventually be as common as import and export of comma-separated values by spreadsheet programs. If you use a taxonomy management program that doesn’t support the standard, let its makers know that you want to see it.

The RDF basis of SKOS also means that you can take advantage of RDF-aware application development tools and libraries to build SKOS editing systems yourself much more quickly than you can build a taxonomy management system where you had to define and implement all the data structures yourself.

Starting small and scaling up

If you have one or more large, complex controlled vocabularies to manage, converting it all to use a new format can be a big, expensive job. Converting a subset to SKOS as a pilot project can be much easier, and if you convert a few different subsets and then eventually connect them by defining the appropriate concept relationships across vocabulary boundaries, you start to see the benefit of SKOS in your own organization. With the growing support of both free and commercial software for the standard, SKOS is definitely worth further investigation by anyone who manages vocabularies and is interested in the benefits of standardization.

Comments? Reply to my tweet announcing this blog entry.

What else can I do with RDFS?

Bob DuCharme — Fri, 20 Aug 2021 11:01:00 +0000

Schemas can be a little fancier and even more useful with no need for OWL.

In my last blog entry, What is RDFS?, I described how the RDF Schema language lets you define RDF vocabularies, with the definitions themselves being RDF triples. We saw how simple class and property name definitions in a schema can, as machine-readable documentation for a dataset’s structure, provide greater interoperability for data and applications built around the same domain. Today we’ll look at how RDF schemas can store additional kinds of valuable information to add to what we saw in the sample schemas last time, and then we’ll look at some of the cool things that RDF schemas let you do.

More data modeling

When we use RDFS to define class and property names we can also define relationships between them. The following expands on the schema from last time to define relations between classes, between properties, and between classes and properties:

@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix vcard:   <http://www.w3.org/2006/vcard/ns#> .
@prefix emp:     <http://www.snee.com/schema/employees/> .
@prefix ex:      <http://www.snee.com/example/> .
@prefix dcterms: <http://purl.org/dc/terms/> . 

emp:Person rdf:type rdfs:Class ;
          rdfs:label "person" . 

emp:Employee a rdfs:Class ; 
            rdfs:subClassOf emp:Person ;
            rdfs:label "employee" . 

vcard:given-name  rdf:type rdf:Property ;
                  rdfs:domain emp:Person ;
                  rdfs:label "given name".

vcard:family-name rdf:type rdf:Property ;
                  rdfs:domain emp:Person ;
                  rdfs:label "family name" ;
                  rdfs:label "apellido"@es . 

emp:hireDate a rdf:Property ;
            rdfs:domain  emp:Employee ;
            rdfs:label   "hire date" ;
            rdfs:comment "The first day an employee was on the payroll."  ;
            rdfs:subPropertyOf dcterms:date . 

emp:reportsTo a rdf:Property ; 
             rdfs:domain emp:Employee ;
             rdfs:range  emp:Employee ;
             rdfs:label  "reports to" .

The first thing that this schema has that the earlier one didn’t is a triple saying that ex:Employee is a subclass of ex:Person. If an inferencing parser saw that employees ex:id2 (Heidi Smith) and ex:id3 (Jane Berger) are instances of the ex:Employee class, it would know that they were also instances of the emp:Person class.

Now that we know how to declare classes and indicate which is a subclass of another, we can build class hierarchies. These will be familiar to people who have used most modern programming languages. However, few if any of these programming languages let you also build property hierarchies. The schema above declares the emp:hireDate property to be a subproperty of the popular Dublin Core vocabulary’s dcterms:date property.

What does this buy you? For one thing, a tool that generates a user interface for this human resources data might not recognize the emp:hireDate property, but if it does the inferencing to find out that this property is a specialized version of the standard dcterms:date one, it might know that a date widget would be more appropriate to represent this field on an editing form than a plain text box.

The Turtle version of the RDFS schema for the Dublin Core DCMI Metadata Terms vocabulary includes nine triples with the predicate and object rdfs:subPropertyOf <http://purl.org/dc/elements/1.1/date>. These show us that properties such as dcterms:available, dcterms:created, and dcterms:dateAccepted are dates. You might guess that from a property named “dateAccepted”, but you wouldn’t know this about a “created” property without this machine-readable way to describe the semantics of that property. (I rarely use the term “semantics”, but when I do use it, I mean it.)

The next new thing to note in this schema, now that we’ve seen how to define relationships between classes, and between properties, is how this schema defines a relationship between a property and a class. The first rdfs:domain triple above associates the vcard:given-name property with the emp:Person class. (Remember that if emp:Employee is a subclass of emp:Person, then this property is now associated with emp:Employee as well.) Is there anything wrong with associating a property defined in a standard vocabulary with my own thing that I’m defining in my own vocabulary? Absolutely not; it’s actually a good thing, because it provides a standards-based context for the thing I’m defining for my own application.

As the W3C RDFS Recommendation tells us, “rdfs:domain is an instance of rdf:Property that is used to state that any resource that has a given property is an instance of one or more classes”. Given this, my schema is saying that if an RDF resource has a vcard:given-name property, then we can infer that that resource is an instance of emp:Person. (If this leads to an inference that the office dog is a person, I should re-evaluate my class hierarchy and which properties are associated with which classes.)

Sometimes we forget that RDFS and OWL were invented to enable this kind of inferencing across data found on the web. They were not invented to help us define data structures, but as I’ve shown, RDFS is handy to at least document them. Continuing with my user-interface-generation example, a system generating an edit form for an Employee instance would know from this schema’s rdfs:domain triplets that this editing form should include vcard:given-name, vcard:family-name, emp:hireDate, and emp:reportsTo fields. (And, as I mentioned last time, it should know that the form would be easier to read if these fields were labeled with the properties’ rdfs:label values and not the actual property names.)

Software developers who recognize the ability to define class hierarchies may be a bit confused by the relationship between classes and properties in RDFS. In standard object-oriented modeling, when you define a class, you define the properties used by that class, and some may be inherited from superclasses. In RDFS, you define classes and properties separately and then associate them, if you like, with the rdfs:domain property. (The fact that properties can have their own hierarchies is something else that can take object-oriented developers some time to get accustomed to.)

The rdfs:range property defined for the emp:reportsTo class is another way to define a relationship between a class and a property. According to the RDFS Recommendation, it “is used to state that the values of a property are instances of one or more classes”. We saw that if emp:reportsTo has an rdfs:domain of emp:Employee, then “X reports to Y” means that X is an emp:Employee; if emp:reportsTo has an rdfs:range of emp:Employee, we can infer from the same statement that Y is an emp:Employee—that is, that an employee reports to another employee. Even if we don’t plan on doing this kind of inferencing with rdfs:range, it’s still useful to indicate what kind of values to expect for a given property. For example, the application generating a form to edit employee data could generate a drop-down list of employee names on the “reports to” part of the form instead of a plain text box.

More support of interesting applications

I’ve written other blog entries about how I applied the ideas described above to various useful projects.

Drive a (mobile!) user interface

In Using SPARQL queries from native Android apps I describe how I used the MIT App Inventor toolkit to create a native Android app that lets the user pick a clothing product and a color and a size for that product before sending the selected information off to a web server. The choice of products, colors, and sizes are all stored in an RDFS model; screenshots from my phone show how the list of color choices expanded after I added a new one to the RDF schema that stored the model. This blog entry also describes how additions to the RDFS model (with no changes to the Android app) would enable support in the app for other spoken languages besides English.

Data integration

Driving Hadoop data integration with standards-based models instead of code describes a data integration demo that combines data from Microsoft’s SQL Server Northwind sample database with data from Oracle’s sample HR database. These databases both describe human resources databases but use different names (for example, LastName and last_name) for similar properties. Using Python and a SPARQL query, the demo collects data from the two sources and represents them using a common vocabulary. The system uses an RDFS model to both define this vocabulary and—and this part is crucial—to define the mapping from the two data sources to this vocabulary using the rdfs:subPropertyOf property mentioned above. After running the demo and expanding the RDFS model to cover more of the input, running the demo again integrates more of the source data with only that expansion of the model. No changes to the Python scripts were necessary.

All the ideas I’ve described about this project so far are pretty simple. The novelty of the article was that I set it all up to happen on a Hadoop cluster distributed across multiple systems, because that was especially hot at the time.

Because this article was written to accompany something I did for IBM Data Magazine, it doesn’t assume familiarity with RDF as much as other entries on my blog, so if you’re new to RDF that might be helpful.

Transform data with partial schemas

My more recent blog entry Transforming data with inferencing and (partial!) schemas describes how, if you have a big mess of more data than you need, an RDF schema for the subset of that data that you actually want can be very useful. This is especially true when you use inferencing to transform the data. I’ll quote the whole first paragraph of that blog posting here:

I originally planned to title this “Partial schemas!” but as I assembled the example I realized that in addition to demonstrating the value of partial, incrementally-built schemas, the steps shown below also show how inferencing with schemas can implement transformations that are very useful in data integration. In the right situations this can be even better than SPARQL, because instead of using code—whether procedural or declarative—the transformation is driven by the data model itself.

Here’s another paragraph from after the piece walks through the demo:

This idea of letting the data and its schema evolve in a more flexible manner is especially great for data integration projects. My example here started off with a (somewhat) big mess of RDF; if you’re working with more than one RDF dataset—maybe with some converted from other formats such as JSON or relational databases—then the use of RDFS to identify little subsets of those datasets and to specify relationships between components of those subsets can help your knowledge graph and the applications that use it become useful a lot sooner.

Again, you’ll see many of the techniques outlined in today’s blog post put to good use in that project.

A bit more useful background

When using certain standards, it’s easy to assume that the standard itself is a batch of long, technical jargon. The W3C RDF Schema Recommendation is not very long and actually quite readable, as I wrote in RDFS: The primary document, so I recommend it. The RDF Schema Wikipedia page also summarizes what RDFS offers and what kinds of things you can do with it quite nicely.

I have been referring to inferencing quite casually, although my “Data integration” and “Transform data with partial schemas” examples do go into more detail about actually executing that. You may also find Living in a materialized world useful; this covers the potential role and mechanics of RDFS inferencing.

And, Hidden gems included with Jena’s command line utilities describes how an open source multi-platform Apache Jena tool can perform RDFS inferencing for you.

Let me know how you end up using RDFS! There is a lot of potential there that has been unused for too long.

Comments? Reply to my tweet announcing this blog entry.

CC BY 2.0 photo by Howard Duncan

What is RDFS?

Bob DuCharme — Sun, 25 Jul 2021 11:55:00 +0000

And how much can a simple schema do for you?

RDFS, or RDF Schema, is a W3C standard specialized vocabulary for describing RDF vocabularies and data models. Before I discuss it further, though, I’d like to explain why the use of standardized, specialized vocabularies (whether RDFS itself or a vocabulary that someone uses RDFS to describe) can be useful beyond the advantages of sharing a vocabulary with others for easier interoperability.

Last month, in What is RDF?, my example dataset included triples whose predicates came from the W3C standard vCard business card ontology. It also included triples from a namespace that I had created myself with my own domain name. Certain kinds of RDF applications go through data and, when they find predicates that use a specialized vocabulary designed for such applications, they execute special tasks designed around that vocabulary. For example, GeoSPARQL applications that find predicates from the http://www.opengis.net/def/function/geosparql/ namespace can perform geospatial math that answers questions such as “what museums are within a mile of New York’s Museum of Modern Art?”, as I described in GeoSPARQL queries on OSM Data in GraphDB.

The use of RDF does not require any schemas. However, the commercial and open source tools that can understand the RDFS vocabulary (by which I mean the RDFS vocabulary itself, not necessarily the ones you define with it) make it easier for applications to build user interfaces around RDF-based applications, to integrate data from disparate datasets, and more. Before we get there, though, let’s look at an example of an RDF schema and some data that uses it.

The following RDFS schema uses the Turtle syntax to describe a few classes and properties.

# Employee schema version 1
# Pound sign lets you add comments to Turtle.
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix emp:   <http://www.snee.com/schema/employees/> .

emp:Person   rdf:type rdfs:Class .
emp:Employee a        rdfs:Class .

vcard:given-name  a rdf:Property .
vcard:family-name a rdf:Property .
emp:hireDate      a rdf:Property .
emp:reportsTo     a rdf:Property .

The first thing to note is that the schema is itself triples, using Turtle RDF to describe a few RDF structures. This means that you can use SPARQL and other RDF tools to work with the schema itself and with collections of schemas.

The second thing to note is how simple a schema can be—in this case, just six triples saying “Here are some classes and properties to potentially use”.

The rdf:type predicate means “is an instance of the following class”, so the first triple above says that emp:Person is itself a class. (Below we’ll see how to create instances of emp:Person.) This schema’s next triple says that emp:Employee is also a class. Instead of the rdf:type predicate, that line uses the shortcut " a ". This means the same thing, but with a syntax that brings the triple closer to the English expression “emp:Employee is a class”.

The remaining four triples in that example list some available properties. I copied two from the vCard vocabulary and made up two myself.

Using the schema

The following instance data uses the classes and properties declared above:

@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix emp:   <http://www.snee.com/schema/employees/> .
@prefix ex:    <http://www.snee.com/example/> .

ex:id1 a emp:Person ; 
       vcard:given-name  "Francis" ;
       vcard:family-name "Jones" .

ex:id2 a emp:Employee ;
       vcard:given-name  "Heidi" ;
       vcard:family-name "Smith" ;
       emp:hireDate      "2015-01-13" .

ex:id3 a emp:Employee ; 
       vcard:given-name  "Jane" ;
       vcard:family-name "Berger" ;
       emp:reportsTo     ex:id2 .

These triples use another bit of Turtle syntax that I didn’t cover last month: a semicolon means “the next triple has the same subject as the last one”. For example, the first three lines after the prefix declarations in this sample data say that resource sn:id1 is an instance of the class Person, has a given name of Francis, and a family name of Jones.

The schema above doesn’t say much, but it’s already at least as useful as a list of the columns in a relational table. Someone who has this schema and is working with this data knows what property names to use if they want query the data, add to it, or delete from it. They also know what the potential classes are and can query for instances of those classes. All of these abilities are a big help if multiple people are going to create interoperable data and applications.

Adding to the schema

The next version of the same schema goes a little further by providing more information about the classes and properties:

# Employee schema version 2
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix emp:   <http://www.snee.com/schema/employees/> .

emp:Person rdf:type rdfs:Class ;
           rdfs:label "person" . 

emp:Employee a rdfs:Class ; 
             rdfs:label "employee" ;
             rdfs:comment "A full-time, non-contractor employee." .

vcard:given-name  rdf:type rdf:Property ;
                  rdfs:label "given name".

vcard:family-name rdf:type rdf:Property ;
                  rdfs:label "family name" ;
                   rdfs:label "apellido"@es . 

emp:hireDate a rdf:Property ;
             rdfs:label   "hire date" ;
             rdfs:comment "The first day an employee was on the payroll." .

emp:reportsTo a rdf:Property ; 
              rdfs:label  "reports to" .

This version includes rdfs:comment and rdfs:label properties. The former function as documentation for the things they’re describing. They should provide clarity as to exactly what the described resource means, like the rdfs:comment value for the emp:Employee resource: “A full-time, non-contractor employee.”

The rdfs:label property provides a human-readable name for the resource being described. This is especially helpful for reports and applications that use this data. For example, if your application will display a form where people can edit data about employees, it would be difficult for these end users to read the form if it labeled its fields with actual property names such as “vcard:given-name” and “emp:hireDate”. On the other hand, you shouldn’t hard-code more readable form field names like “hire date” and “family name” in your application code, either.

For some real model-driven development you want to set it up so that as your model (as encoded by the schema) evolves the application automatically adapts to this evolution wherever possible. Providing display names as part of the model helps move your application toward this goal. An application that uses the revised version of my sample schema can use rdfs:label values such as “family name” and “given name” to provide much more readable form field labels.

RDF (and hence RDFS) also let you add language tags to literal values. If you add multiple rdfs:label values to an RDF resource and you tag each of these values according to its language, then the model-driven development described above can extend to the generation of forms in different languages for different users. In the second version of my schema the resource vcard:family-name has labels in both English and Spanish. (A future version of the schema should have Spanish labels for the other classes and properties as well.) You can even include language codes for country-specific versions of terms so that a given form could be displayed in American English, British English, Castilian Spanish, Mexican Spanish, and more, all based on data in the schema.

Remember that while I’m using rdfs:label and rdfs:comment values in an RDFS schema here, you can also use them in any RDF you like. For example:

@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix emp:   <http://www.snee.com/schema/employees/> .
@prefix ex:    <http://www.snee.com/example/> .

ex:id3 a emp:Employee ;
       vcard:given-name "Jane" ;
       vcard:family-name "Berger" ;
       rdfs:label "Jane Berger" ;
       rdfs:comment """Jane has taken the sales department from being only her
                    and an assistant to the ten-person team we have today.""" .

(This rdfs:comment value here is shown as a long literal, which encloses the values in triple quotation marks so that the value can include carriage returns.) Similarly, you can add language tags to any RDF literal values you want—not just RDFS schemas.

schema.org

In my next blog entry I’ll describe some fancier modeling that you can do with RDFS and how it can help applications such as data integration and even a mobile application. I’ll also mention (as I have before) how, in the debate over schema-driven software development versus schemaless development, the use of partial schemas can give you the best of both worlds. (Last month I promised a few of those things for this blog entry, but for this entry I wanted to emphasize the value of RDFS’s most basic constructs.)

Meanwhile, take a look at the RDFS schema for schema.org. From the Vocabulary Definition Files section of the page Schema.org for Developers you can pick which variation you want, in which serialization; I would pick the Turtle serialization to see how the schema demonstrates what I’ve been describing here.

You should recognize a lot of the Turtle version of the schema.org schema, because it’s mostly declarations of classes and properties with rdfs:label values and descriptive rdfs:comment values. Schema.org provides an excellent role model for RDFS development—all without any OWL! Fifteen years ago I had a difficult time finding an example of RDFS being used without any OWL mixed in, and I think Schema.org has been a real inspiration since then.

From now on, when you see a given set of RDF terms being used, ask “where can I find a schema documenting it?” And, if you find a schema (or OWL ontology) describing a model, ask “where can I see sample data that follows this schema? (Schema.org sample data tends to be in JSON-LD, but you can convert it to Turtle easily enough.)

Comments? Reply to my tweet announcing this blog entry.

CC BY 2.0 photo by Howard Duncan

What is RDF?

Bob DuCharme — Sun, 27 Jun 2021 13:20:00 +0000

What can this simple standardized model do for you?

I have usually assumed that people reading this blog already know what RDF is. After recent discussions with people coming to RDF from the Linked (Open) Data and Knowledge Graph worlds, I realized that it would be useful to have a simple explanation that I could point to. This builds on material from the first three minutes of my video SPARQL in 11 Minutes.

RDF, or Resource Description Framework, is a W3C standard (along with HTML, CSS, and XML) for a simple, flexible data model. RDF lets you describe data using a collection of three-part statements that can say things like “employee 3 has a title of ‘Vice President’.” We call these three parts the subject, predicate, and object. You can think of them as an entity identifier, an attribute name, and an attribute value.

The subject and predicate are actually represented using URIs (Uniform Resource Identifiers) to make it absolutely clear what we’re talking about. URIs are similar to URLs (Uniform Resource Locators), and often look like them, but they’re not locators, or addresses; they’re just identifiers.

The URIs in the following show that:

we mean employee 3 from a specific company
we mean “title” in the sense of job title and not a label for a book, movie, or other creative work, because we’re using the URI for title defined by the W3C’s published version of the vCard business card ontology

The object, or third part of a triple, can also be a URI:

This way, the same resource can be the object of some triples and the subject of others, which lets you connect up triples into networks of data called graphs.

RDF’s popular Turtle syntax often shortens the URIs by having an abbreviated prefix stand in for everything in the URI before the last part. This makes URIs simpler to read and write.

@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix sn:    <http://www.snee.com/hr/> .

sn:emp3 vcard:title "Vice President" .

Just about any data can be represented as a collection of triples. For example, we can usually represent each entry of a table by using the row identifier as the subject, the column name as the predicate, and the value as the object.

This can give us triples for every fact on the table.

@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix sn: <http://www.snee.com/hr/> .

sn:emp1   vcard:given-name   "Heidi" .
sn:emp1   vcard:family-name   "Smith" .
sn:emp1   vcard:title   "CEO" .
sn:emp1   sn:hireDate   "2015-01-13" .
sn:emp1   sn:completedOrientation   "2015-01-30" .

sn:emp2   vcard:given-name   "John" .
sn:emp2   vcard:family-name   "Smith" .
sn:emp2   sn:hireDate   "2015-01-28" .
sn:emp2   vcard:title   "Engineer" .
sn:emp2   sn:completedOrientation   "2015-01-30" .
sn:emp2   sn:completedOrientation   "2015-03-15" .

sn:emp3   vcard:given-name   "Francis" .
sn:emp3   vcard:family-name   "Jones" .
sn:emp3   sn:hireDate   "2015-02-13" .
sn:emp3   vcard:title   "Vice President" .

sn:emp4   vcard:given-name   "Jane" .
sn:emp4   vcard:family-name   "Berger" .
sn:emp4   sn:hireDate   "2015-03-10" .
sn:emp4   vcard:title   "Sales" .

Some of the property names here come from the vcard standard vocabulary. For the properties not available in vcard or another standard vocabulary that I knew of, I made up my own property names using my own domain name. Many other standardized vocabularies such as schema.org, geonames, and Dublin Core provide URIs to help you make the exact sense of a term clear. (As one example, I would have used Dublin Core if I wanted to use the term “title” to refer to a book.) RDF makes it easy to mix and match standard vocabularies and customizations.

The data in the example above fits neatly into the table shown. Imagine that it was in a relational table and we wanted to add information about Heidi Smith’s university degree. With a relational table, we’d have to add a new column to the table—a structural change to the database itself that would probably require a database administrator. To do this with RDF, it’s just one more triple:

sn:emp1 sn:degree "MFA University of Iowa 2015" .

Imagine that a database administrator had added a degree column to the relational table, but now Heidi has an additional degree to describe in the data. The degree column can only store one degree description, so to allow for employees having more than one degree in a relational database, the database administrator would probably remove the new degree column from that table and then create one or more entirely new tables to track the relationship of employees to degrees. In RDF, it would be just one more triple:

sn:emp1 sn:degree "MBA Wharton 2019" .

A triple object that is not a URI is known as a literal. In the examples we’ve seen so far, the literals are all strings, but they can be other data types. They can be XSD data types such as boolean, integer or float, and they can be data types that you define yourself:

@prefix sn:  <http://www.snee.com/hr/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

sn:emp1 sn:startDate "2021-03-04"^^xsd:date . 
sn:emp1 sn:empCode   "D1"^^sn:myCustomDataType .

RDF syntaxes

I mentioned earlier that RDF is a standardized data model. There have been various syntaxes to write it down. The original was called RDF/XML; XML was used because it was standardized and flexible, and also because one of the original RDF use cases was to add arbitrary metadata to web pages—the idea was that an additional block of XML would fit well into an HTML file’s head element. As it turned out, using XML to represent arbitrary collections of relationships could get verbose and messy. Because of this, no one uses RDF/XML anymore, but unfortunately, in the early days, this particular syntax gave RDF itself a bad reputation. (My own theory is that the file naming convention of giving RDF/XML files an extension of “.rdf” made people think that that’s what RDF really was.)

Now most people use Turtle, which is much simpler and also a W3C standard. Other syntaxes are available, including the increasingly popular JSON-LD. All the examples shown in this introduction use Turtle syntax.

SPARQL

SPARQL (“SPARQL Protocol and RDF Query Language”) is another W3C standard. The protocol part is usually only an issue for people writing programs that pass SPARQL queries back and forth between different machines.

SPARQL queries typically use a Turtle-like syntax to describe patterns of what kinds of triples to retrieve from a dataset. The patterns often resemble Turtle triples but with variables serving as wildcards to add flexibility to the matching patterns and to store values that result from matches. The following query asks for the given name and family name of everyone with a job title of “Vice President”:

PREFIX  vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX  sn:    <http://www.snee.com/hr/>

SELECT ?givenName ?familyName
WHERE
  { ?employee vcard:title "Vice President" .
    ?employee vcard:given-name  ?givenName .
    ?employee vcard:family-name ?familyName .
  }

You can see more examples of simple SPARQL queries in the video SPARQL in 11 Minutes.

Triplestores

A triplestore is a database manager for RDF triples. A wide choice of open source and commercial triplestores is available, some of which can store billions of triples. They typically offer both web-based graphical user interfaces and programmatic ways to add, edit, and retrieve data.

The “P” for “Protocol” in “SPARQL” is the basis for some of the programmatic interfaces. This is yet another example of how tools for working with RDF are all based on open, published standards and supported by a broad range of implementations.

Data Integration

The second sentence of the W3C RDF Overview page tells us that “RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed”. At the simplest level, you can integrate two different RDF datasets by just concatenating the files together, assuming that both use a syntax such as Turtle or N-Triples. Loading multiple datasets into the same dataset of a triplestore, whether those datasets use the same syntax or not, is also easy and popular.

This ease of data integration has been a big driver in RDF’s success as people convert data from various other formats and models to RDF in order to easily use the combination. (In an upcoming “What is RDFS?” blog entry I will describe how RDF Schema can define optional models that make this even easier.)

The Semantic Web

In the early days of RDF, the idea of sharing machine-readable data across the World Wide Web as the “Semantic Web” was popular to the point of being overhyped because it sometimes got mixed up in vague, old-school Artificial Intelligence ideas of machines “understanding” things. We saw above how to show that “title” was meant in the sense of “job title” instead of a label for a book; this indicates some of the meaning, or semantics, of the word in a useful, machine-readable way.

In “What is RDFS?” we’ll see how triples can show that Heidi Smith is an instance of the Employee class, and how if Employee is a subclass of Person, then we can infer that Heidi is also an instance of Person and has the associated properties. OWL lets you do even more. These little bits of semantics can be very useful, but the hype around the possibilities of a connected web of such semantics—and around this web’s potential destiny as a platform for end-user applications—led to the term “Semantic Web” falling out of fashion.

Linked (Open) Data

There is no standard specification for what counts as Linked Data. Many point to a Design Issues document that web inventor and W3C Director Tim Berners-Lee wrote with the caveat “personal view only”. The document outlines some rules and best practices for sharing of machine-readable data across platforms.

Below the document’s four rules of Linked Data is an enumeration of the “5 Stars of Linked Data” that reflects how I’ve seen the term widely used. It includes the possibility that a CSV file available on a web server can be considered Linked Data, if not 5 Star Linked Data, and this has appealed to many people who admire the ideas behind Linked Data but don’t necessarily like RDF in any syntax—especially in RDF/XML. In general, Linked Data puts more emphasis on the sharing of machine-readable data using URIs and URLs than on the syntax of the data itself.

Many organizations have found that Linked Data principles for sharing data across platforms have benefited their own use of data integration behind their firewalls. Linked Open Data applies these principles to data shared with the world. Berners-Lee’s document describes Linked Open Data as “Linked Data which is released under an open licence, which does not impede its reuse for free”; this typically means data shared on the public Internet where everyone can access it.

Whether your Linked Data is open or not, the on-line book Linked Data Patterns by Leigh Dodds and Ian Davis is a great place to learn about best practices for sharing data using Linked Data principles. Jonathan Blaney’s Introduction to the Principles of Linked Open Data also provides some good background.

Knowledge Graphs

We’ve seen how RDF triples can combine into graphs. Graph data structures are older than computer science itself. The term “knowledge graph” has been around for a few years too, but it became especially popular after an engineering SVP at Google published Introducing the Knowledge Graph: things, not strings in 2012. After this, many people working with different kinds of graph data tools started saying “Google stores their data in a knowledge graph? So do we, and you can, too!” RDF-based systems store data in a graph and include many options for storing semantics, so they’re an excellent candidate for storing knowledge graphs. The ease of data integration is also appealing to people interested in knowledge graphs, who often want to merge multiple graphs into a whole that is greater than the sum of its parts. I wrote more about this at Knowledge Graphs!

RDF and You

If you first learned about RDF from one of the approaches described above, I hope that I’ve given you a broader context of what it has done and can do. It’s important to remember that RDF and SPARQL are open standards with many implementations in the commercial and open source worlds. Because of their popularity in the academic world, many accuse these standards of being limited to academia, but that’s just not true. Brand-name companies all over the world are seeing the value and increasing their usage of these standards all the time.

I’d like to close with a quote from Dan Brickley and Libby Miller’s book Validating RDF Data:

People think RDF is a pain because it is complicated. The truth is even worse. RDF is painfully simplistic, but it allows you to work with real-world data and problems that are horribly complicated. While you can avoid RDF, it is harder to avoid complicated data and complicated computer problems.

Next time we’ll see how RDFS can help deal with some of those complications.

Calling your own JavaScript functions from SPARQL queries

Bob DuCharme — Sun, 23 May 2021 11:25:00 +0000

More Jena arq fun.

When I saw “Add support for scripting languages other than JavaScript” in the Jena release 4.0.0 release notes my first reaction was “What? I can run the arq command line SPARQL processor and call my own functions that I wrote in JavaScript?”

The ARQ - JavaScript SPARQL Functions page of the Jena documentation shows how to do this. I had some fun playing with this capability, and as you’ll see, it offers some easy opportunities to clean up and improve your data.

First, let’s see how it looks on the command line to run arq with a SPARQL query that calls external JavaScript functions. It’s basically a typical invocation of arq with an additional --set parameter to point at a file of JavaScript functions, which in this example is called myjs.js:

arq --set arq:js-library=myjs.js --query jstest.rq --data phoneNumbers.ttl

The data file that I used for my experiments simply lists a few people and their phone numbers. The v:homeTel values use several different conventions for notating US phone numbers:

@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix d: <http://learningsparql.com/ns/data#> .

d:i9771 v:given-name "Cindy" ;
        v:homeTel "1 (203) 446-5478" .

d:i0432 v:given-name "Richard" ;
        v:homeTel "   (729)556-5135   " .

d:i8301 v:given-name "Craig" ;
        v:homeTel "9232765135" .

d:i8309 v:given-name "Leigh" ;
        v:homeTel "843-5544" .

The query in jstest.rq copies the triples and also does the following:

Passes the v:homeTel value to a normalizeUSPhoneNumber() function that I wrote in the myjs.js file.
Calls the createRating() function in the same JavaScript file and passes the result to the CONSTRUCT clause, which puts the generated value in a d:rating triple.
Calls a JavaScript Date() function directly (as opposed to calling it via something in myjs.js) and assigns the returned value to an ?updateDate variable that also gets used in the CONSTRUCT clause.

Notice how all of the JavaScript function calls in the SPARQL query have a js: prefix that is declared at the top like any other prefix. This is how arq knows that these are external JavaScript functions.

# jstest.rq

PREFIX js: <http://jena.apache.org/ARQ/jsFunction#>
PREFIX d:  <http://learningsparql.com/ns/data#> 
PREFIX v:  <http://www.w3.org/2006/vcard/ns#> 

CONSTRUCT {
  ?s v:given-name ?name ; 
  v:homeTel ?normalizedUSPhoneNumber ;
  d:rating  ?starRating ;
  d:as-of   ?updateDate;
}
WHERE {
  ?s v:given-name ?name ;
  v:homeTel ?phoneNum .
  BIND (js:normalizeUSPhoneNumber(?phoneNum) AS ?normalizedUSPhoneNumber)
  BIND (js:createRating() AS ?starRating)
  BIND (js:Date() AS ?updateDate)  # calling JavaScript function directly
}

The JavaScript file defines two functions, both mentioned above:

normalizeUSPhoneNumber() uses regular expressions to convert the phone number to an nnn-nnn-nnnn format if it has an area code and nnn-nnnn if it doesn’t. While SPARQL offers some support for regular expressions when you’re calculating a Boolean value to use in a FILTER expression, it doesn’t let you use regular expressions to manipulate values that can then be used in output, so I wanted to write a function that would demonstrate that.
createRating() generates a random integer between one and five to demonstrate how we can call the random() function to generate a number and then use other functions to massage that number into something we want.

// myjs.js

function normalizeUSPhoneNumber(phoneNumber) {
  phoneNumber = phoneNumber.replace(/ /g, "")
    .replace(/^1/g,"")
    .replace(/-/g,"")
    .replace(/\(/g,"")
    .replace(/\)/g,"")
    .replace(/(\d\d\d\d$)/, "-$1");
  if (phoneNumber.length > 10) {
     phoneNumber = phoneNumber.replace(/^(\d\d\d)/,"$1-");
  }
  return phoneNumber;
}

function createRating() {
   return Math.ceil(Math.random()*5);
}

Running the command line shown with these files gives us this output:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix v:     <http://www.w3.org/2006/vcard/ns#> .
@prefix js:    <http://jena.apache.org/ARQ/jsFunction#> .

d:i9771  d:as-of      "Mon May 10 2021 08:02:35 GMT-0400 (EDT)" ;
        d:rating      3 ;
        v:given-name  "Cindy" ;
        v:homeTel     "203-446-5478" .

d:i8309  d:as-of      "Mon May 10 2021 08:02:35 GMT-0400 (EDT)" ;
        d:rating      5 ;
        v:given-name  "Leigh" ;
        v:homeTel     "843-5544" .

d:i0432  d:as-of      "Mon May 10 2021 08:02:35 GMT-0400 (EDT)" ;
        d:rating      4 ;
        v:given-name  "Richard" ;
        v:homeTel     "729-556-5135" .

d:i8301  d:as-of      "Mon May 10 2021 08:02:35 GMT-0400 (EDT)" ;
        d:rating      2 ;
        v:given-name  "Craig" ;
        v:homeTel     "923-276-5135" .

Running it more than once gives different values for d:rating each time, as I had hoped. (You always want to double-check that with random functions.)

I also wanted to demonstrate a filter condition with a function that takes multiple arguments and returns true or false, and that’s easy enough to do, but I couldn’t think of a good one that would do something that I couldn’t do in SPARQL. In SPARQL something like that might take up multiple lines of the query, so it would be more verbose, but still, comparing values in multiple variables to then set a Boolean as true or false is straightforward in standard SPARQL without calling some external function.

Since writing this little demo I have already used this ability to call external JavaScript functions to clean up some data in another project the way I did with the phone numbers above. I had the SPARQL query above call js:Date() directly to show that we can call JavaScript functions directly from such queries; if I hadn’t, I would have the query call a new function in the myjs.js file that called js:Date() and then used regular expressions or some other string manipulation tools to trim the returned date value down or convert it to ISO 8601 format. It would be another good example of how this ability to call external JavaScript functions from a SPARQL query makes the excellent library of native JavaScript functions available to a SPARQL developer.

Hidden gems included with Jena’s command line utilities

Bob DuCharme — Sun, 25 Apr 2021 11:58:00 +0000

Lots of ways to manipulate your RDF from the open-source multiplatform tool kit

On page 5 of my book Learning SPARQL I described how the open source RDF processing framework Apache Jena includes command line utilities called arq and sparql that let you run SPARQL queries with a simple command line like this:

arq --data mydata.ttl --query myquery.rq

At the time, the arq one supported some SPARQL extensions that the sparql one didn’t. I don’t even remember what they were and tended to use arq just because the name is shorter. I have since learned that with support for the extensions being added to sparql, there are now no particular differences between the two.

Jena (which recently celebrated release 4.0.0) includes Linux and Windows versions of many other utilities in addition to arq and sparql. I’ve mentioned several here when I used one or another to accomplish a particular task, and I thought it would be nice to summarize some of the ones that I have and have not mentioned before. I may be repeating some earlier explanations, but it should be handy to have them in one place.

You’ll find Linux utilities such as arq and shacl in Jena’s bin directory and corresponding Windows utilities such as arq.bat and shacl.bat in its bat directory.

Remember that, like arq and sparql, many of these support additional command line parameters beyond the ones I show here. Use --help with each to find out more. I tried to demo what I found to be the most useful about each.

You can find more background about some of these utilities on the Jena documentation pages ARQ - Command Line Applications (which covers more than just arq) and the “Command line tools” section of the Reading and Writing RDF in Apache Jena page.

And thanks to Andy Seaborne for reviewing a draft of this!

rdfdiff

Use the rdfdiff utility to compare two dataset files. It’s like the venerable UNIX command diff, except that it looks for different triples instead of lines. The order of the input triples doesn’t matter to rdfdiff, and it can compare data files in different serializations. For example, here is a little RDF/XML file:

<!-- joereceiving.rdf -->
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:d="http://whatever/" > 
  <rdf:Description rdf:about="http://whatever/emp3">
    <d:dept>receiving</d:dept>
    <d:name>joe</d:name>
    <d:insurance rdf:resource="http://www.uhc.com"/>
  </rdf:Description>
</rdf:RDF>

Here is a Turtle file with roughly the same information:

# joereceiving.ttl

@prefix w: <http://whatever/> .

w:emp3 w:name "Joseph" ;
       w:dept "receiving" ;
       w:insurance <http://www.uhc.com> .

I ran this command to compare the two, also including the names of their formats:

rdfdiff joereceiving.rdf joereceiving.ttl RDF/XML TURTLE

I got this output:

< [http://whatever/emp3, http://whatever/name, "joe"]
> [http://whatever/emp3, http://whatever/name, "Joseph"]

Like the text file comparison utility diff, the report uses < as a prefix to show you what was in the first file but not the second and > to show you what was in the second but not the first.

As with many other Jena utilities, you can use the URL of a remote file instead of the name of a local file for either or both of the first two arguments.

shacl

In Validating RDF data with SHACL I described how to use an open source tool developed by TopQuadrant to validate RDF data against constraints on that data that are described using the W3C SHACL standard. Jena includes a shacl utility to do the same kind of validation, and when running this with the employees.ttl file that that blog entry links to, all of my examples described there work with Jena shacl as well.

Because the employees.ttl file had class definitions, instance data, and SHACL shapes all defined within that one file, I passed that filename as both the --data and --shapes parameter when I ran this command line tool:

shacl validate --data employees.ttl --shapes employees.ttl

It found all of my test constraint violations:

After I uncommented the data’s e2 example, shacl reported that it was missing the required hr:jobGrade value.
After I uncommented the e3 example, it reported that its hr:jobGrade value was not an integer.
After I uncommented the e4 example, it reported that its hr:jobGrade value fell out of the allowed range.

As the SHACL specification requires, the validation reports produced by shacl were themselves sets of triples, whether it found violations or not. This makes it easier to fit the tool into an RDF processing pipeline.

Adding -v for “verbose” after shacl validate in that command line adds additional information to the output.

The utility’s print option outputs the rules in the file. It can do this as regular RDF, compact SHACL syntax (surprisingly useful if you have a lot of rules), or the default: a simple text representation.

shacl print --out=RDF employees.ttl     # out=RDF, compact, or text

qparse and uparse

The qparse utility parses a query and can do various things with it as described by its --help option. I recently learned that it can pretty-print queries, so if the spacing and indentation of a query that you’re trying to understand is a mess, qparse can make it easier to understand and even capitalize keywords and add line numbers.

Here is a sloppily formatted little query:

# namedept.rq
prefix w: <http://whatever/> Select
* WHERE { ?s w:name ?name . optiONAL {       ?s w:dept ?dept } }

I run this command,

qparse --query namedept.rq

and I get this output:

PREFIX  w:    <http://whatever/>

SELECT  *
WHERE
  { ?s  w:name  ?name
    OPTIONAL
      { ?s  w:dept  ?dept }
  }

Adding --num to the command line would add line numbers to the output.

The uparse utility can do the same thing for update queries. The following pretty-prints the file updatetest.ru:

uparse --file=updatetest.ru

Further documentation about both commands is available in the Jena documentation.

rsparql

This sends a local query to a SPARQL endpoint specified with a URL. I would typically use curl for this, but after reviewing the --help options for rsparql I see that it makes it easier to specify that you want the results in text, XML, JSON, CSV, or TSV. When sending a SPARQL query with curl, you can’t assume that the endpoint supports all of these result formats, and you probably have to look up their mime types, because I certainly haven’t memorized them.

The following sends the SPARQL query in the 5triples.rq file to the Wikidata endpoint and then outputs the results at the command line:

rsparql --query 5triples.rq --service=https://query.wikidata.org/sparql

rupdate

This send a local update query to a SPARQL endpoint specified with a URL. It will have to be one where you have update permission, which may well be a locally running copy of Fuseki. The following executes the update request stored in updatetest.ru on the test1 dataset in the locally running copy of Fuseki (assuming that fuseki-server was started up with the --update parameter, as described below):

rupdate --service=http://localhost:3030/test1 --update=updatetest.ru

rdfparse

This parses an RDF/XML document. People don’t use RDF/XML much anymore, and with good reason, but if you find any RDF/XML this is a simple way to convert it. The riot utility, described below, is even better, but I especially like the -R switch available with rdfparse; this tells it to search through an arbitrary XML document and extract any triples stored within embedded rdf:RDF elements. That can be great for processing some RDF that was embedded into XML before JSON-LD or even RDFa were around. Here’s a nice arbitrary XML document that I called xproduct1.xml:

<myDoc>

  <header><whatev/></header>

  <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:d="http://whatever/" > 
    <rdf:Description rdf:about="http://whatever/emp1">
      <d:dept>shipping</d:dept>
      <d:name>jane</d:name>
    </rdf:Description>
  </rdf:RDF>

  <arbitraryElement/>

  <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:d="http://whatever/" > 
    <rdf:Description rdf:about="http://whatever/emp3">
      <d:dept>receiving</d:dept>
      <d:name>joe</d:name>
    </rdf:Description>
  </rdf:RDF>

</myDoc>

I run the following command,

rdfparse -R xproduct1.xml

and it produces this nice ntriples output:

http://whatever/emp1> <http://whatever/dept> "shipping" .
<http://whatever/emp1> <http://whatever/name> "jane" .
<http://whatever/emp3> <http://whatever/dept> "receiving" .
<http://whatever/emp3> <http://whatever/name> "joe" .

Working with Fuseki datasets from the command line

Jena includes several utilities that let you work with datasets created using Jena’s Fuseki SPARQL server. Their ability to load and update data can be very helpful in an automated system that uses Fuseki as its backend data store.

To create some of this data to test with, I used the following command to start up Fuseki in a mode that would allow updates to data that it was storing:

fuseki-server --update

When you go to Fuseki’s GUI interface at http://localhost:3030 and tell it that you want to create a new dataset, you have to choose between three types of dataset: in-memory ones that will not persist from session to session, “Persistent” ones that use the older TDB format, and “Persistent (TDB2)” ones that use the more advanced TDB2 format. For my examples below I just created TDB2 datasets. TDB versions of the commands are also included with Jena, but if you’re creating a new dataset, you may as well use TDB2.

Most of these utilities expect you to specify a path to an assembler file to tell those utilities which Fuseki dataset to operate on. I never tried making my way through the Jena Assembler howto documentation, but I recently noticed that Fuseki creates assembler files for us, so I don’t have to worry about their structure and syntax because I can have Fuseki make them for me. When I used Fuseki’s GUI to create a TDB2 dataset called test1, Fuseki created the assembler file apache-jena-fuseki/run/configuration/test1.ttl, so I knew where to point the command line utilities.

These command line tools won’t work with the Fuseki datasets if you have Fuseki running because Fuseki locks the files. My examples below assume that I have created the test1 dataset describe above, used the web-based interface to upload data to it (although, as we’ll see, this can be done with command line tools as well), and then shut down the Fuseki server.

Additional information about these commands is available at TDB2 - Command Line Tools.

Dumping dataset contents

The following command showed me the contents of that TDB2 dataset at the command line:

tdb2.tdbdump --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl

Querying a Fuseki dataset

With a SPARQL query stored in myquery.rq, this command queries the test1 dataset and outputs the results at the command line:

tdb2.tdbquery --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl --query myquery.rq

Setting of the output format is similar to doing it with arq. Run tdb2.tdbquery --help to find out more.

Updating a Fuseki dataset

With the file updatetest.ru storing a SPARQL INSERT update request that inserts a single triple, the following command didn’t show anything at the command line,

tdb2.tdbupdate --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl --update updatetest.ru

but when I restarted the Fuseki server and used the web-based interface to query dataset test1 for all of its triples, I saw the triple inserted by the updatetest.ru query in there with the triples that had been in there before.

Loading a data file into a Fuseki dataset

The following loaded the triples in the file furniture.ttl into the test1 dataset (which I confirmed the same way I did with my previous example) and displayed some status messages:

tdb2.tdbloader --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl furniture.ttl

It’s best to make sure that there are no parsing problems with the file you load before you load it. A quick way to do that is with the --validate parameter of the riot command:

riot --validate furniture.ttl

Other command line utilities for Fuseki datasets

The following commands all work on the dataset whose assembler file you point to with the --tdb parameter:

tdb2.tdbstats outputs a LISPy set of parenthesized expressions telling you about the dataset.
tdb2.tdbbackup creates a gzipped copy of the dataset’s triples.
I tried tdb2.tdbcompact and got a status message of “Compacted in 0.570s”; someday I’ll try this with a larger dataset to really investigate the effect.

riot

Jena includes many command line utilities that I won’t describe here because riot (“RDF I/O Technology”) combines them all into one utility that I have been using more and more lately. I mentioned in Pulling Turtle RDF triples from the Google Knowledge Graph how it can accept triples via standard input, which was great for the use case that I described there of converting Google Knowledge Graph JSON-LD to Turtle triples on the fly.

We’ve already seen another nice use of riot above: validating a file of triples before loading it into dataset stored on a server.

Converting serializations

To simply convert an RDF file from one serialization to another, use the riot --output parameter to name the new serialization:

riot --output=JSONLD emps.ttl

The Jena utilities nquads, ntriples, rdfxml, trig, and turtle are all specialized versions of riot that produce the named serializations with no need for an --output parameter.

Counting triples

When I want to know how many triples are in a Turtle file, here’s what I usually do:

Look around my hard disk for a query file that uses COUNT to count all the triples.
Give up looking.
Look up the COUNT syntax in my book “Learning SPARQL”.
Write another query file for counting all the triples.

Now I can just use riot with this simple command line:

riot --count furniture.ttl

It also works with quads.

Concatenating

Jena includes an rdfcat utility that outputs the concatenated contents of any data files listed on its command line. First, it outputs a header that says “DEPRECATED: Please use ‘riot’ instead”. Providing multiple data file names as arguments when running riot (I think I just got another pun of the name) will by default output an ntriples version of their concatenated triples with status messages showing where each one starts. Adding --quiet suppresses the status messages, and --output lets you specify a different output serialization.

Inferencing

Jena includes an inferutility that does inferencing from an RDFS model, but I no longer bother with it because riot can do this as well. The following little RDFS model shows that two properties from the Oracle and Microsoft sample relational databases are subproperties of similar schema.org properties:

# empmodel.ttl
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix schema:   <http://schema.org/> . 
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw:       <http://snee.com/vocab/schema/SQLServerNorthwind#> .

oraclehr:employees_first_name rdfs:subPropertyOf schema:givenName  . 
oraclehr:employees_last_name  rdfs:subPropertyOf schema:familyName . 
nw:employees_FirstName        rdfs:subPropertyOf schema:givenName  . 
nw:employees_LastName         rdfs:subPropertyOf schema:familyName .

Here is some data using the Oracle and Microsoft properties:

# emps.ttl
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix schema:   <http://schema.org/> . 
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw:       <http://snee.com/vocab/schema/SQLServerNorthwind#> .

oraclehr:employees_100 oraclehr:employees_last_name "King" ;
    oraclehr:employees_first_name "Steven" .

nw:employees_2 nw:employees_LastName "Fuller" ;
    nw:employees_FirstName "Andrew" .

This command tells riot to do inferencing on emps.ttl using the RDFS modeling in empmodel.ttl:

riot --rdfs empmodel.ttl emps.ttl

And here is the ntriples result with spaces added for more readability:

<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://snee.com/vocab/schema/OracleHR#employees_last_name> "King" .
  
<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://schema.org/familyName> "King" .
  
<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://snee.com/vocab/schema/OracleHR#employees_first_name> "Steven" .
  
<http://snee.com/vocab/schema/OracleHR#employees_100>
  <http://schema.org/givenName> "Steven" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://snee.com/vocab/schema/SQLServerNorthwind#employees_LastName> "Fuller" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://schema.org/familyName> "Fuller" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://snee.com/vocab/schema/SQLServerNorthwind#employees_FirstName> "Andrew" .
  
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
  <http://schema.org/givenName> "Andrew" .

The new triples show that these employees have schema.org properties in addition to the original OracleHR and Northwind properties. This ability makes this kind of inferencing great for data integration, as I described in Driving Hadoop data integration with standards-based models instead of code. (In that I used the Python libray rdflib to do the same kind of inferencing, but that’s the beauty of standards—having a choice of tools to implement the same expected behavior.)

Pulling Turtle RDF triples from the Google Knowledge Graph

Bob DuCharme — Sun, 28 Mar 2021 13:05:00 +0000

Even querying by type!

When I wrote about my first deep dive into Knowledge Graphs, I mentioned that although the term was around well before 2012, the idea of a Knowledge Graph was blessed as an official Google thing that year when one of their engineering SVPs published the article Introducing the Knowledge Graph: things, not strings. This blessing gave some focus to many members of the graph database community because they could say that what they had been doing was similar, if not the same, as what Google was doing.

I still didn’t think of the Google Knowledge Graph as a specific thing, but as more of a marketing term describing a set of technologies, like IBM’s Watson. I have changed my mind: in Pascal Hitzler’s A Review of the Semantic Web Field in the Communications of the ACM I learned that there is an actual, RESTful Google Knowledge Graph Search API, and I’ve been having some fun pulling Turtle RDF triples out of it.

That Google page demonstrates what you can put in a URL to request JSON-LD data from their Knowledge Graph. Their first example sends a search for “Taylor Swift”; below I have used that example with curl and piped the output through the Jena riot command line utility (not to be confused with DJ Jenna Riot, who I just learned about in a web search) so that I could get Turtle triples of the result. I won’t even bother showing the JSON-LD version here because I can get the Turtle version with this single command:

curl \
  "https://kgsearch.googleapis.com/v1/entities:search?query=taylor+swift&key=API_KEY&limit=1&indent=True" \
  | riot --syntax=JSONLD --output=turtle

Two notes about this command line:

I substituted my own API key for “API_KEY” above. You can get your own at API Key by filling out a few forms.
When you feed RDF to riot, it can usually guess the serialization from the end of the input filename, but when piping data to it from stdout like I do above, you need the --syntax parameter to tell it what flavor of RDF you are feeding it.

That command gave me 14 triples, including these:

<http://g.co/kg/m/0dl567>
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://schema.org/Thing> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://schema.org/Person> ;
        goog:detailedDescription  _:b2 ;
        <http://schema.org/description>  "American singer" ;
        <http://schema.org/name>  "Taylor Swift" ;
        <http://schema.org/url>   "http://www.taylorswift.com/" .

_:b2    <http://schema.org/articleBody>  "Taylor Alison Swift is an American 
            singer-songwriter. Her narrative songwriting, which often takes 
            inspiration from her personal life, has received widespread 
            critical praise and media coverage.\n" ;
        <http://schema.org/url>  "https://en.wikipedia.org/wiki/Taylor_Swift" .

The Wikipedia page for the now-defunct Freebase database tells us that “On 16 December 2015, Google officially announced the Knowledge Graph API, which is meant to be a replacement to the Freebase API”, so I’ve been missing out on this for a while. The Taylor Swift data above includes an interesting bit of the Freebase legacy: the local name of the URI used to represent her as a resource in the Google Knowledge Graph is m/0d1567, which we can see on her Wikidata page was the identifier that Freebase used for her. For people, places, and things that were not represented in Freebase at the time that Freebase shut down in 2016 (for example, Lil Nas X, whose Wikipedia page shows no Freebase identifier and says that he has been active since 2018) I assume that some Google algorithm just generates new identifiers in their Knowledge Graph.

More query API options

You can pick apart the URL with the Taylor Swift query and then reassemble it with new pieces using the Google Knowledge Graph API Reference. For instance, that query has a limit value of 1, but the API reference tells us that this can be up to 500, with a default value of 20. The reference page also includes a form you can fill out with sample API call parameters to learn about them more interactively than you would by revising a curl command over and over.

A more interesting option for the query URL is types, which lets you limit your search to entities of one or more specified schema.org types. For example, a query that uses parameters of query=charles+schwab&type=Corporation returns information about the company with that name, but query=charles+schwab&type=Person returns information about its founder. (Because types is plural you can also specify a comma-delimited list as that parameter’s value.)

With no limit parameter in the URL, the query about Charles Schwab the person actually returned eight people: Charles R. Schwab, the founder of the financial services firm; Pennsylvania steel magnate Charles M. Schwab; Émile Martin Charles Schwabe, a Swiss Symbolist painter and printmaker, and five other people.

This brings me to a few triples returned by my command line above that I didn’t show in the Taylor Swift example. Because the request sends a query to Google, just like a search entered at www.google.com, the server actually returns a list of search results. Here is the beginning of the Turtle version of the search result for the person Charles Schwab:

_:b0    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://schema.org/ItemList> ;
        <http://schema.org/itemListElement>  _:b1 ;
        <http://schema.org/itemListElement>  _:b2 ;
        <http://schema.org/itemListElement>  _:b3 ;
        <http://schema.org/itemListElement>  _:b4 ;
        <http://schema.org/itemListElement>  _:b5 ;
        <http://schema.org/itemListElement>  _:b6 ;
        <http://schema.org/itemListElement>  _:b7 ;
        <http://schema.org/itemListElement>  _:b8 .

_:b1    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> goog:EntitySearchResult ;
        goog:resultScore            1.105882568359375E3 ;
        <http://schema.org/result>  <http://g.co/kg/m/028lhc> .

<http://g.co/kg/m/028lhc>
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Thing> ;
        goog:detailedDescription  _:b9 ;
        <http://schema.org/description>  "American magnate" ;
        <http://schema.org/name>  "Charles M. Schwab" .

The first instance in the data is an item list. This points at instances of entitySearchResult; the first of these has the blank node _:b1 as its identifier. This search result points to information about the steel magnate, which identifies him with his Freebase ID, and it also has a search result score.

The API documentation tells us that the result score is “an indicator of how well the entity matched the request constraints”. I imagine that this is not simply a score of string similarity but also takes into account the popularity of each search result—otherwise, I don’t know how the result score would be 12 for financial services firm founder Charles R. Schwab, 1.1 for steel magnate Charles M. Schwab, and 6 for Swiss symbolist Émile Martin Charles Schwabe.

Linking that data

The Google Knowledge Base API doesn’t return a large amount of data for each entity, but when you have the Freebase ID, you can use it to retrieve additional data about that entity from Wikidata. The following simple little Wikidata query (try it here) uses the Freebase ID that we saw above for steel magnate Charles Schwab to pull down 140 triples about him from Wikidata:

CONSTRUCT {?s ?p ?o } WHERE {
   ?s wdtn:P646 <http://g.co/kg/m/028lhc> ;
      ?p ?o .
  }

Exploring for more data

The Google Knowledge Graph API includes a boolean prefix parameter that “[e]nables prefix (initial substring) match against names and aliases of entities”. The following asks for all entities of type MusicGroup whose name begins with “bea”:

curl \
  "https://kgsearch.googleapis.com/v1/entities:search?prefix=true&query=bea&types=MusicGroup&limit=500&key=API-KEY" \
  | riot --syntax=JSONLD > beagroups.ttl

The 481 results included the Beatles, Beach Boys and Beastie Boys, as I expected.

I was wondering if a sorted list of result scores would reveal any pattern, and then I realized, duh, I can write a SPARQL query to do that; it’s why I pulled the data as triples! (I could execute a query against the JSON-LD, but I prefer to work with Turtle because it’s easier to read.)

PREFIX s: <http://schema.org/>

SELECT ?resultScore ?bandName WHERE {
  ?result      <http://schema.googleapis.com/resultScore> ?resultScore ;
               s:result ?musicGroup .
  ?musicGroup  s:name ?bandName . 
}
ORDER BY DESC(?resultScore)

Here are the first few results when running this query against the RDF of “bea” music groups that the curl command above pulled down:

---------------------------------------------------------------------------------------------
| resultScore         | bandName                                                            |
=============================================================================================
| 2.518089111328125E3 | "The Beatles"                                                       |
| 3.5488818359375E2   | "Beastie Boys"                                                      |
| 1.969714050292969E2 | "Beak"                                                              |
| 1.761080169677734E2 | "Beatrice"                                                          |
| 1.361932220458984E2 | "Brooklyn Bounce"                                                   |
| 1.338223876953125E2 | "Battle Beast"                                                      |
| 1.335271911621094E2 | "Beatsteaks"                                                        |
| 1.331909942626953E2 | "Beartooth"                                                         |
| 1.256562881469727E2 | "Beady Belle"                                                       |
| 1.212170104980469E2 | "Beatfreakz"                                                        |
| 1.101853561401367E2 | "The Trammps"                                                       |

(Yes, the Trammps, of Disco Inferno fame.) The Beach Boys ranked at 111, well below many groups I’ve never heard of that, like the Trammps, didn’t even have “bea” anywhere in their name: Vansire? The Parlotones? Turbotronic?

The ability to pull typed data directly from Google’s Knowledge Graph is pretty great, especially since we can link much of that data to other good data sources. I had considered titling this blog entry “Piping data to stdin of Jena’s riot utility” (talk about your clickbait!) but as you can see decided to go with the Knowledge Graph angle—not because this term is a popular way to talk about graph databases in general, but because we’re pulling data from the graph that Google itself is calling a Knowledge Graph.

Still, this ability to feed data to riot via stdin is pretty nice, and it smooths a key handoff of this trick the old-fashioned UNIX way. When these pieces are all assembled together like this, they make it easier to incorporate Google Knowledge Graph data into the wide range of RDF-based tools that are out there. It will have many great applications.

Linking different knowledge graphs together

Bob DuCharme — Sun, 28 Feb 2021 11:10:00 +0000

Really linking them, not doing ETL.

Lately I’ve been thinking about some aspects of RDF technology that I have taken for granted as basic building blocks of dataset design but that Knowledge Graph fans who are new to RDF may not be fully aware of—especially when they compare RDF to alternative ways to build knowledge graphs. A key building block is the ability to link independently created knowledge graphs.

...it gives a better idea of what the “semantic web” was about: the world-wide linking of, not just documents, but (in more 2021 terminology) knowledge graphs.

For a little historical perspective: before Tim Berners-Lee invented the web, hypertext systems were all very closed systems. A Storyspace story (one of which I still own on a three and half inch floppy disk) could not link to an Apple Hypercard “stack” and a HyperCard stack could not link to a Storyspace story. The World Wide Web let any hypertext page anywhere in the world link to any other, and just look how far that has scaled.

Imagine that you and I want to create relational data and use it in the same SQL system. We can’t just go off and each define our own database schema and expect our two databases to work together. The design work must be coordinated so that our respective contributions are essentially designed as a single system. Otherwise, the data from your system must be read (Extracted from your system), converted to be compatible with my system (Transformed), and then Loaded into my system—a process known in the industry as ETL. If the data in your system later gets updated, my system’s users won’t know it until we repeat the whole process or invoke some custom ETL process to identify and retrieve the new parts.

This was never the case with independently designed web pages, because anyone’s page could link to anyone else’s web page, and it’s not the case with RDF knowledge graphs. If I make one available on the public Internet, you can connect yours to mine so that as your and my datasets evolve, the connections themselves can remain the same but you’ll gain the benefits of the updated datasets. If we’re using different identifiers to refer to the same things, a little modeling can be part of the connection to indicate which things are the same, and then you’re off and running using the two datasets as one knowledge graph.

The format of RDF graph node identifiers follow a published IETF standard. The identifiers themselves remain universally unique (as with Java package names, they’re built off of domain names, which lets domain owners establish their own naming conventions) so your ability to reference one of my graph nodes from your data means that a link from your data to mine will work very simply. This was the “linked” part of Linked Data, and getting back to the once-revolutionary possibility of any hypertext document linking to any other hypertext document, it gives a better idea of what the “semantic web” was about: the world-wide linking of, not just documents, but (in more 2021 terminology) knowledge graphs, especially when modeling of the graph can be part of the graph itself.

(Just to whet your appetite, I’m going to demonstrate all of this below by linking a new graph of the Beatles’ favorite drinks and sports to a remote graph I made several years ago about who played what instruments on which songs.)

There are two basic steps to linking your knowledge graph to someone else’s. As with HTML documents, you don’t need any kind of permission or cooperation from anyone on the destination system to make the link if that destination is available via HTTP.

Either use the same resource identifiers that the graph you are linking to does or add some modeling that maps your identifiers to theirs.
Use a SPARQL query (and implicitly, the SPARQL protocol—the “P” in SPARQL that defines a standard way to to transmit queries and results back and forth) to ask an endpoint for a graph subset meeting the conditions for your application.

The nodes of a graph need identifiers, and some non-RDF graph storage systems keep these under the covers to make your life simpler. If you can see these identifiers, though, both in your own graphs and in the graphs of others, it’s easier to identify nodes in the different graphs and create connections between them, connecting their host graphs into a larger graph. (And, RDF URI identifiers are no more difficult to read than URLs, which aren’t that difficult to read… unless you’re in SharePoint world, in which case you have my sympathy.) If you know that two graphs use different identifiers for the same resource, your own data model can assert that both identifiers reference the same resource—with the data model statements just being additional edges on your own graph—and then standards-compliant (often free!) software can then take advantage of those assertions.

To quote Pascal Hitzler’s recent Communications of the ACM article A Review of the Semantic Web Field (which uses the abbreviation IRI for “Internationalized Resource Identifiers”, a superset of URIs that allow a broader range of character choices),

What is usually associated with the term “linked data” is that linked data consists of a (by now rather large) set of RDF graphs that are linked in the sense that many IRI identifiers in the graphs also appear also in other, sometimes multiple, graphs. In a sense, the collection of all these linked RDF graphs can be understood as one very big RDF graph.

Retrieving remote data from a SPARQL endpoint

A SPARQL query can use the SERVICE keyword to request data from another graph via an endpoint. It can then combine the retrieved data with local data and use the combination as a larger graph with more helpful connections than the local data has. For example, let’s say that after reviewing the website The Beatles Interview Database you’ve compiled the following data about the Fab Four, and you have it stored locally in a file called BeatlesFaves.ttl:

@prefix b: <http://www.bobdc.com/ns/beatles/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

# Sources:
# http://www.beatlesinterviews.org/db1964.0614b.beatles.html
# http://www.beatlesinterviews.org/db1964.0906.beatles.html

wd:Q2632 rdfs:label "Ringo Starr" ;
    b:favoriteDrink "bourbon" ;
    b:favoriteBritishGroup "Rolling Stones" . 

wd:Q1203 rdfs:label "John Lennon" ;
    b:favoriteBritishGroup "Rolling Stones" . 

wd:Q2599 rdfs:label "Paul McCartney" ;
    b:favoriteBritishGroup "The Searchers" ;
    b:favoriteDrink "scotch and Coke" . 

wd:Q2643 rdfs:label "George Harrison" ;
    b:favoriteBritishGroup "The Animals" .

This data uses the Wikidata identifiers to identify the individual Beatles so that the data will more easily integrate with other data that may use these identifiers—just as Pascal described above—such as Wikidata itself. (The Wikidata identifiers are easy to find; just look for “Wikidata item” on the left side of any Wikipedia page, such as Ringo’s.)

You have learned that several years ago some guy (OK, me) published an RDF graph of data at http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl about who played what instruments on which Beatles songs. (The creation of this dataset is described at SPARQL queries of Beatles recording sessions along with some fun queries.) The URIs in that dataset that identify the musicians use URIs built from the musicians’ names instead of using Wikipedia URIs. (There were so many musicians that I didn’t want to look all of them up in Wikidata manually, and some have names that are common enough that automating the lookup wouldn’t have worked too well.)

To show that wd:Q2632 from one graph is the same as m:RingoStarr from the other I created a triple using owl:sameAs. This predicate basically says “all facts about each of these two resources are true for the other one, so they are effectively the same resource”.

My use of an OWL predicate required me to use a SPARQL processor that could handle more than RDFS. (See Transforming data with inferencing and (partial!) schemas for examples of RDFS inferencing as part of a graph processing pipeline.) I only needed a little more than RDFS; “RDFS Plus” is a non-standard superset that adds owl:sameAs support and a few other useful OWL bits to a SPARQL processor without committing to a full implementation of one of the official OWL Profiles.

To get this owl:sameAs support I used the free version of the GraphDB triplestore, which I’ve also used recently because of its GeoSPARQL support. “RDFS Support” is something you select when creating a GraphDB repository, so I did that and unchecked GraphDB’s “Disable owl:SameAs” checkbox. (I’m guessing that this checkbox is available because overuse of owl:sameAs can use a lot of computing cycles.)

After loading the BeatlesFaves.ttl data above, I loaded the following mapToWikidata.ttl file:

@prefix m:   <http://learningsparql.com/ns/musician/> .
@prefix wd:  <http://www.wikidata.org/entity/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

wd:Q2632 owl:sameAs m:RingoStarr  . 
wd:Q1203 owl:sameAs m:JohnLennon . 
wd:Q2599 owl:sameAs m:PaulMcCartney .
wd:Q2643 owl:sameAs m:GeorgeHarrison .

After doing this, a query of this repository for all the triples showed statements like {m:GeorgeHarrison rdfs:label "George Harrison"}, which was not a triple in either of the loaded knowledge graphs but was inferred from the combination, so I knew I was all set.

The SPARQL query

I could have read the http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl file into GraphDB just like I read in BeatlesFaves.ttl and mapToWikidata.ttl, but that would be the old-fashioned ETL approach, where querying across datasets is really a query of a single dataset created by copying them all into one place. What if the remote dataset got updated with the names of the cellists on “The Long and Winding Road”, which are currently not there? I would have to either identify the new triples added to the remote data or reload the whole remote dataset. Instead of reading in the entire remote dataset, I would rather read the data that I need from it dynamically at query time to make sure that I had the most recent data.

I can do this with SPARQL’s SERVICE keyword. This specifies the URL of a SPARQL endpoint and a query to send to it. The following query uses this keyword to find out the favorite British band of the bass player from “The Long and Winding Road”:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX i:    <http://learningsparql.com/ns/instrument/>
PREFIX s:    <http://learningsparql.com/ns/schema/> 
PREFIX b:    <http://www.bobdc.com/ns/beatles/>

SELECT ?britishGroup
WHERE { ?bassist b:favoriteBritishGroup ?britishGroup .
  SERVICE <https://dydra.com/bobdc/beatles-musicians/sparql>
  { SELECT ?bassist 
    WHERE { ?song a s:Song ; 
                  rdfs:label "The Long And Winding Road" ;
            i:bass ?bassist .
          }
  }
}

(Fun fact I had never noticed before: John plays bass on that.) For this demo, I stored the data on Dydra, which made it very easy to create a free account, upload data and make it available via a SPARQL endpoint, and to then set user access levels to that data. The data can be maintained on Dydra easily enough, so that a call to a Dydra endpoint really is retrieval from a dynamic database.

The inner query above asks the remote data about the song’s bass player, binding the URI representing the bassist to the ?bassist variable. The outer query then asks for the favorite British group of this bass player, which turned out to be the Rolling Stones.

Note that the ?bassist variable will store the identifier http://learningsparql.com/ns/musician/JohnLennon and the locally-stored data says that the Stones were the favorite British band of resource wd:Q1203. That’s why I added a modeling triple wd:Q1203 owl:sameAs m:JohnLennon and used GraphDB, a triplestore that supports owl:sameAs as part of the RDFS superset that it supports. Remember, not all triplestores do, so that’s something to think about when planning an application.

This ability to send a subquery off to a remote system and then the result is an important aspect of both the SPARQL query language and the SPARQL protocol, which has its own standardized specification. When you consider different systems that may play roles in building and using knowledge graphs, keep in mind that SPARQL’s mechanics for tying local and remote data together are both standardized and widely implemented.

(A background note: I had also planned to show another query that retrieved the recording session data from http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl using SPARQL’s FROM keyword. When I investigated why some SPARQL processors did retrieve remote RDF files specified by this keyword and some didn’t, I learned that as a security consideration this retrieval is not required.)

SPARQL’s ability to link together different RDF knowledge graphs—even when those graphs aren’t necessarily using the same identifiers to refer to the same resources—provides another huge benefit: it reduces the need for large complex schemas (typically, ontologies) to create useful knowledge graphs. Imagine that I create a small RDF knowledge graph that achieves certain goals, and then you create another that achieves different goals, and then a third person realizes that these two graphs are both related to the application that she is working on. Ideally, you and I would have each included a schema (which is just more triples!) listing the classes and properties we used; even small schemas would help people like this third person take advantage of our datasets. Whether we made schemas available or not, though, she can use the technique described above to connect the two graphs into a whole that is greater than the sum of its parts, growing into a larger knowledge graph the way that the collection of HTML pages available via HTTP has grown into the World Wide Web since 1993.

Transforming data with inferencing and (partial!) schemas

Bob DuCharme — Sun, 24 Jan 2021 11:00:00 +0000

An excellent compromise between schemas and "schemaless" development.

If you’re working with more than one RDF dataset, then the use of RDFS to identify little subsets of those datasets and to specify relationships between components of those subsets can help your knowledge graph and the applications that use it become useful a lot sooner.

I originally planned to title this “Partial schemas!” but as I assembled the example I realized that in addition to demonstrating the value of partial, incrementally-built schemas, the steps shown below also show how inferencing with schemas can implement transformations that are very useful in data integration. In the right situations this can be even better than SPARQL, because instead of using code—whether procedural or declarative—the transformation is driven by the data model itself.

Also, the models are RDF Schemas, also known as RDFS. When people talk about RDF inferencing, they’re often talking about some of the more advanced inferencing that the superset (actually, supersets) of RDFS known as OWL can do. Many people don’t realize how much you can do with simple RDFS inferencing.

Schema, no schema, or… some schema

For most of the history of data processing on computers, people needed to spell out the structure of their data before they could actually start accumulating data. For example, when using relational databases, you can’t add a row to a table unless you (or someone) has already specified all the columns that are going to be in that table and all of their types. In fact, you probably had to do this for all the database’s other tables as well, because the tables aren’t really ready until their relationships have all been straightened out through the process of normalization.

The rise of NoSQL databases—especially MongoDB—and the fact that schemas were optional for XML got developers excited about the ability to add any data of any structure they wished to a dataset. Since then, blogs have been full of debates about the value of developing with vs. without schemas. Not enough people appreciate the wonderful compromise offered by RDF knowledge graphs, where partial schemas can give you the best of both worlds, so I wanted to demonstrate that.

Finding a big mess of RDF

I wanted to start with an RDF dataset that was bigger and more complex than I needed so that I could show how a schema for just a subset of it could help to get only the parts that I wanted. On a page for the YouTube offering of a search engine API company I found a 26K sample of JSON that their API would return on a search for “Star Wars”, so I used AtomGraph’s JSON2RDF to convert that to an RDF file that I called ytstarwars.ttl.

This turned the JSON’s unnamed containers into a lot of triples with blank nodes in the RDF. The structural relationships of these triples were easier to see after a look at the JSON, which had some header data and then an array named movie_resultswith JSON objects about movies that each had title, description, and several other properties. After that was a similar array named video_results that had objects with title and description properties and others (not identical to the movie object properties) that I didn’t care about.

After telling JSON2RDF to use a base URI of http://bobdc.com/ns/pschemademo in its output I got a lot of RDF triples with predicates of http://bobdc.com/ns/pschemademo#video_results (hereafter, t:video_results), blank node subjects that represented the array, and blank node objects that represented the videos themselves. To describe the individual videos, the RDF included triples with these videos as subjects and predicate-object pairs like (t:title “Star Wars: The Empire Strikes Back”) and (t:link “https://www.youtube.com/watch?v=Ooh3k8cJDBg".)

It sounds messy but isn’t too bad when you flip back and forth between the JSON and the RDF that JSON2RDF created. The fun part was transforming this RDF into something simpler and cleaner—not with the SPARQL CONSTRUCT queries that I would typically use to turn one set of RDF into another, but with a schema and inferencing.

Transforming with a schema

I started the schema with just this:

# pschema1.ttl
@prefix t:    <http://bobdc.com/ns/pschemademo#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>  .

t:Video a rdfs:Class .
t:title rdfs:domain t:Video .

The first triple declares t:Video to be a class.

We often use the rdfs:domain property to say “this property is associated with this class”, which is a typical thing to do in a data model, but the second triple above actually does more than that: it says that if a resource has a t:title property, then an inferencing parser should infer that this resource is an instance of the t:Video class. (Or, in triple terms: if the parser finds a triple with t:title as its predicate, then infer a new triple saying that the found triple’s subject is an instance of the specified class.)

Several of the command line utilities that come with Apache Jena let you use an --rdfs switch to point to a vocabulary file of triples to use for inferencing. Here’s how I used Jena’s riot utility to parse the Turtle version of the YouTube Star Wars query result with inferencing based on the schema above:

riot --rdfs=pschema1.ttl ytstarwars.ttl > temp.ttl

The result is a copy of the input with triples like these added:

_:Ba465efcc265d609003ef1776e61da647 rdf:type t:Video .
_:Ba465efcc265d609003ef1776e61da647 rdf:title "LEGO® Star Wars™ The Build Zone" .

In addition to videos in the search results, there were also movie results from the movie_results array, so let’s declare the same rdfs:Class and rdfs:domain triples for them and then do more inferencing…

But there’s a problem. Movie results also have t:title properties, and the schema above says that anything with a t:title is a video result. How can the schema distinguish between videos and movies, and how can we say that both videos and movies have titles?

I mentioned earlier that the RDF created by AtomGraph includes triples with predicates of t:video_results, blank node subjects that represent the video results array, and blank node objects that represent the members of the array—the videos themselves. It also includes similar t:movie_results triples to store movies.

The first draft of the schema above used RDF’s rdfs:domain property to say that if a triple has a particular predicate then the resource represented by its subject is an instance of a particular class. The second draft uses a different part of the RDFS vocabulary: rdfs:range.

# pschema2.ttl
@prefix t:    <http://bobdc.com/ns/pschemademo#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>  .

t:Video a rdfs:Class .
t:Movie a rdfs:Class .

t:video_results rdfs:range t:Video . 
t:movie_results rdfs:range t:Movie .

Unlike the rdfs:domain property, the rdfs:range property tells the inferencing engine that if a particular property is used as a triple’s predicate, then that triple’s object is a member of a particular class. The t:video_results triple in this new schema tells the inferencing engine that when it sees the triple {_:blankNode1 t:video_results _:blankNode2} in the input, it should create the triple {_:blankNode2 a t:Video}. The other rdfs:range triple in the schema does something similar to say that the object of t:movie_results triples are instances of t:Movie.

The first two triples in the new schema declare those two classes, but strictly speaking this isn’t necessary. If the schema says that _:blankNode1 is a member of a particular class, then the inference engine will infer that that class exists. It’s still worth declaring the classes, though, because an important reason to have schemas in the first place is to show the structure of the data to people using that data so that they can get more out of it.

Running a similar riot command with the new schema then creates new triples such as the following:

_:B43a50d34335d3e6c8db6403bc5bea2cf a t:Movie .
_:B2f3a9c7d55b4e5ab6272a20db6a16b97 a t:Video .

How do we show that the title, description, and link properties in the triples generated by AtomGraph apply to videos and movies but not necessarily to other classes that may come up in this data? With another incremental modeling step: we’ll make the Movie and Video classes subclasses of another class (in this case, CreativeWork from schema.org; I may as well take advantage of an existing standard to make the data more interoperable with other applications) and declare that the properties go with that superclass:

# pschema3.ttl
@prefix t:    <http://bobdc.com/ns/pschemademo#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>  .
@prefix s:    <http://schema.org/> 

t:Video a rdfs:Class ;
        rdfs:subClassOf s:CreativeWork . 

t:Movie a rdfs:Class ;
        rdfs:subClassOf s:CreativeWork . 

t:video_results rdfs:range t:Video . 
t:movie_results rdfs:range t:Movie . 

t:title rdfs:domain s:CreativeWork .
t:link rdfs:domain s:CreativeWork .
t:description rdfs:domain s:CreativeWork .

Here are some of the triples generated by riot from that schema, with blank node names and t:description values shortened to fit here better:

_:Ba2f a t:Video .
_:Ba2f a s:CreativeWork .
_:Ba2f t:title "2020 Portrayed by Star Wars" .
_:Ba2f t:link "https://www.youtube.com/watch?v=L8Sezzl7_zU" .
_:Ba2f t:description "A Parody of Star Wars in which..." .

_:B166 a t:Movie .
_:B166 a s:CreativeWork .
_:B166 t:title "Star Wars: The Empire Strikes Back" .
_:B166 t:link "https://www.youtube.com/watch?v=Ooh3k8cJDBg" .
_:B166 t:description "Discover the conflict between good and ..." .

There is a lot more modeling that I could do with this data. I could take greater advantage of the schema.org ontology and maybe Dublin Core as well so that my data interoperates with other data and applications better. The remainder of the data converted by AtomGraph has more properties and classes which I may or may not care about. If I do, I can add more to my schema; if I don’t, I’m done.

The value of inferencing from schemas is really just a bonus to this exercise. The original key points I meant to prove here are:

A little schema can provide a little value right away.
Incrementally building on it can provide more and more value.
Your schema doesn’t need to cover all of your input data.

In my last blog entry I wrote about the excellent “Knowledge Graphs” paper (pdf) written by some experts in many related topics as a product of a Schloss Dagstuhl conference in 2018. One bit of that paper that I quoted is very relevant to this blog entry as well:

Graphs allow maintainers to postpone the definition of a schema, allowing the data – and its scope – to evolve in a more flexible manner than typically possible in a relational setting, particularly for capturing incomplete knowledge.

This idea of letting the data and its schema evolve in a more flexible manner is especially great for data integration projects. My example here started off with a (somewhat) big mess of RDF; if you’re working with more than one RDF dataset—maybe with some converted from other formats such as JSON or relational databases—then the use of RDFS to identify little subsets of those datasets and to specify relationships between components of those subsets can help your knowledge graph and the applications that use it become useful a lot sooner.

It works at the other end of the scale as well. For proof of concept work, a small bit of data with a small schema can help to prove your concept. From there, incrementally adding to this data and schema can get those who saw the proved concept more and more interested as you build it up. This agile approach goes over well with software developers, who have good reasons to be suspicious of starting off with a large complex schema. (I actually consider a schema with no corresponding data to only be a schema proposal: how do we know that the schema is doing a good job? The academic world is full of these, although they are more often known as ontologies.)

Note that I did all of this without any SPARQL. I would probably use some SPARQL as one more step to pull out the inferred triples instead of keeping all those original triples about the JSON file’s structure that AtomGraph generated, but that would be as a convenience. The main work of transforming the data subset that I had into the model that I wanted was still performed with the RDFS model.

I’ve written another example of how incremental schema development can benefit an application in in Driving Hadoop data integration with standards-based models instead of code. (Note the subtitle: “RDFS Models!”) The main point at the time was to show how this could all work on a Hadoop infrastructure. I took RDF generated from two different employee databases with two different structures, built a small model that integrated subsets of them, ran a script that performed the integration, expanded the model, and ran the same script to perform a larger integration with no changes to the script itself. Hadoop or no Hadoop, this example provides another nice example of how RDFS inferencing with gradually growing schemas can help you take advantage of existing datasets that were not originally designed for your application.

Knowledge Graphs!

Bob DuCharme — Sun, 20 Dec 2020 11:45:00 +0000

Semantic Linked Knowledge Web Data Graphs?

For several years I thought of “knowledge graphs” as the buzzphrase that had partially replaced “Linked Data”, which was the buzzphrase that had partially replaced “Semantic Web”. In a 2012 blog entry I explained how Hadoop and the new-at-the-time NoSQL databases had convinced me that even if a technology has a funny name, selling it based on the problems it solves makes more sense and ages better than selling a buzz phrase vision and then, if that goes well, describing the technology that enables that vision. (In another blog entry I described how the second edition of my book Learning SPARQL had “55% more pages! 23% fewer mentions of the semantic web!”) In other words, I’ve had time to get more suspicious of buzz phrase visions over the years.

The hot new thing

I also knew that RDF-related vendors have been talking about knowledge graph capabilities for several years, but these same vendors were also talking about Semantic Web and Linked Data capabilities before that, so I thought that they were just rebranding with the new buzz phrase as a marketing strategy. Recently, though, I realized how far the excitement about knowledge graphs had spread independently of that community. My initial surprise was Ben Lorica’s interview with Mayank Kerjiwal on his “Data Exchange” podcast about knowledge graphs. Then, when I tweeted about it, Paco Nathan recommended that I join the Knowledge Graph Conference Slack group. I’d been aware of Lorica and Nathan’s work for years but had given up on RDF-like technology making much of a blip on their radar.

I joined the Slack group and found old friends and some new ones there. When I asked the group about a good definition of “knowledge graph” I was a bit inundated, especially with pointers to vendor explanations. As a bandwagon buzzphrase for our time, many vendors are shouting “That currently hot thing? Yeah! That’s what we do!” (even the SEO sharks have smelled blood in this water) so I was less interested in the vendor perspectives on a good definition. In that Slack thread, Tomas Deely pointed me to the paper simply titled “Knowledge Graphs” (pdf) written by @juansequeda, Antoine Zimmerman (@MonsieurAZ ), and 14 other people whose names were less familiar to me. As Juan explained to me, the paper came out of a Schloss Dagstuhl conference in 2018.

A serious, informative review of knowledge graph technology

It was nice to see the formal discipline of this paper—for example, its description of “the distinction between nodes/edges and entities/relations”—when compared with all the me-too vendor definitions of knowledge graphs floating around. After 5 or 6 years of my looking at knowledge graphs through RDF-colored glasses this paper gave me a broader perspective. Its introduction tells us:

The goal of this tutorial paper is to motivate and give a comprehensive introduction to knowledge graphs: to describe their foundational data models and how they can be queried; to discuss representations relating to schema, identity, and context; to discuss deductive and inductive ways to make knowledge explicit; to present a variety of techniques that can be used for the creation and enrichment of graph-structured data; to describe how the quality of knowledge graphs can be discerned and how they can be refined; to discuss standards and best practices by which knowledge graphs can be published; and to provide an overview of existing knowledge graphs found in practice.

That’s a lot of material to cover, so the paper is long. After I read the 78-page main body of the 132-page paper, Appendix A on page 108 (after 30 pages of 583 footnote references) in particular gave me the perspective that I was looking for on both the long-term and recent histories of knowledge graphs as well as the relative roles of RDF and non-RDF technologies along the way. I very strongly recommend that people interested in knowledge graphs start with this five-page appendix. (Note: I read and took notes about the paper a few weeks ago and just noticed that all of my numbers earlier in this paragraph were off. I then saw the “11 Dec 2020” date stamp on the first page and realized that it has been revised since I read it. I’ve tried to update my numbers and quotes here to reflect the latest version.)

The short version of the history of knowledge graphs begins in 2012 (the year I blogged about reducing my usage of the term “semantic web”!) when an engineering SVP at Google published Introducing the Knowledge Graph: things, not strings. Sections 2 and 3 of the Schloss Dagstuhl Knowledge Graph paper’s Appendix A are titled “‘Knowledge Graphs’: Pre 2012” and “‘Knowledge Graphs’: 2012 Onwards” because this Google article was such a key event in knowledge graph history. Appendix A gives perspective on what makes which research—both before and after 2012—relevant to whatever is now considered knowledge graph technology. Section 3 also sorts out various classes of “knowledge graph” definitions, providing good context on the paper’s own definition given in both its introduction and in its summary at the end: “a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities”. We can all think of followup questions for that definition, but for 28 words it’s pretty good.

Section 3 of the appendix includes an important, lower-level supplement to this definition that addresses the important question of what makes a graph data structure a knowledge graph : “We refer to a knowledge graph as a data graph potentially enhanced with representations of schema, identity, context, ontologies and/or rules”.

Reading through RDF-colored glasses

All of these potential enhancements are covered in some detail in the main body of the paper. Although that coverage often goes for pages without mentioning any RDF-related technology, when I see (through my admittedly RDF-colored glasses) discussions of schema, identity, and ontologies around a tourism example of named node-edge-node triples, I of course see lots of RDF. Here’s an example:

Graphs allow maintainers to postpone the definition of a schema, allowing the data – and its scope – to evolve in a more flexible manner than typically possible in a relational setting, particularly for capturing incomplete knowledge. Unlike (other) NoSQL models, specialised graph query languages support not only standard relational operators (joins, unions, projections, etc.), but also navigational operators for recursively finding entities connected through arbitrary-length paths.

There’s no mention of RDF or SPARQL there, but it certainly lists many of their key capabilities. (I’ll be discussing the wonderful possibilities of partial RDF schemas, as a compromise between no schemas and fully detailed ones, in an upcoming blog entry.) It goes on:

Standard knowledge representation formalisms – such as ontologies and rules – can be employed to define and reason about the semantics of the terms used to label and describe the nodes and edges in the graph.

It was interesting how that and many other parts of the paper discussed capabilities that OWL has provided since RDF’s early years. OWL has a smaller profile in the RDF world than it used to (people who once thought that OWL might help to define data structures that could help maintain data quality have turned to SHACL, because that’s not really what OWL was for, and these users’ confusion over all the different OWL profiles didn’t help) so it was interesting to see how much the Schloss Dagstuhl Knowledge Graph paper discussed ontologies, Description Logics (the DL in OWL-DL), T-Boxes, A-Boxes, Individuals, and especially entailment and related topics where OWL can contribute plenty.

Graph and non-graph technologies

Something else that made me think of knowledge graphs as a vague buzzphrase was the way people used the term to reference technologies that are quite separate from the use of graph data structures: relational databases, text indexing, named entity recognition and other areas that fall under the currently overlapping umbrellas of machine learning and artificial intelligence.

Some of those do have direct applications to graphs, such as the use of embeddings with triples, and the Schloss Dagstuhl paper covers these technologies. It has good reasons for this, describing them as techniques by which knowledge graphs can be “enriched from diverse sources of legacy data that may range from plain text to structured formats (and anything in between).” The paper’s conclusion sums up the relationship among these technologies nicely:

Research on knowledge graphs can become a confluence of techniques arising from different areas with the common objective of maximising the knowledge – and thus value – that can be distilled from diverse sources at large scale using a graph-based data abstraction. Pursuing this objective will benefit from expertise on graph databases, knowledge representation, logic, machine learning, graph algorithms and theory, ontology engineering, data quality, natural language processing, information extraction, privacy and security, and more besides.

(A side note on embeddings, RDF and knowledge graphs: The RDF2VEC algorithm used to do embeddings with RDF has been around since 2016, and you can find many discussions about it since then that refer to RDF graph embeddings. More recent discussions of it, though, have titles like How to Create Representations of Entities in a Knowledge Graph using pyRDF2Vec. It’s another example of an RDF thing that’s been around for a while now being described as a knowledge graph thing because of the current cachet of the term.)

Knowledge graphs and RDF

Section 10.2 of the Schloss Dagstuhl Knowledge Graph paper, “Enterprise Knowledge Graphs”, includes footnoted mentions of over a dozen brand-name companies who have discussed their knowledge graph initiatives. I did a quick skim of all the referenced works to check for references to RDF technology and found that Thomson Reuters has got RDF in that mix, which didn’t surprise me. The big surprise for me was Pinterest not only using RDF but using OWL. Put a pin in that!

Discussions of the work at the other companies didn’t mention RDF, but most were fairly high-level discussions, so I’m guessing that some of them use RDF and some don’t. (While the referenced AstraZeneca article didn’t mention RDF, I know that as a customer of Allegrograph, Ontotext and TopQuadrant—for whom I did training at AstraZeneca—they have been RDF fans for a while, especially because of the data integration possibilities.)

Knowledge graphs are not as synonymous with RDF as those of us with the aforementioned glasses might like to think. In fact, knowledge graphs currently looks bigger than that, and it’s easy enough to picture companies like Google, Facebook, and eBay defining their own data structures and schema languages to build graphs with no reference to the relevant W3C standards. I don’t think this is necessarily a bad thing; it looks like a pretty big tent.

The vision thing

I described how I became suspicious of selling RDF technology by building a buzz phrase vision around it and starting the marketing pitch with that. The pleasant surprise in my study of the knowledge graph world is that it was built around currently important ideas and needs, not any specific technology, and RDF-related technology turns out to provide excellent, standardized, widely implemented open source and commercial support for the implementation of knowledge graphs. In other words, this newer vision came along fairly independently of RDF and happens to be a great fit for it, so I’ll just try to be grateful. I’ll still be a bit self-conscious when I insert the phrase “knowledge graph” into a technology discussion that I would have written anyway even if knowledge graphs weren’t so hot—as if I were jumping on the bandwagon—but it’s a pretty nice bandwagon, and RDF people have a lot that we can contribute to it.

Using SPARQL to combine Wikidata and OSM triples

Bob DuCharme — Sun, 22 Nov 2020 12:15:00 +0000

Linking that data.

Last month in GeoSPARQL queries on OSM Data in GraphDB I showed how to use SPARQL to retrieve triples about Manhattan museums from OpenStreetMap’s SPARQL endpoint. Then, after loading the triples into Ontotext’s free GraphDB triplestore, I showed how GraphDB’s support for the GeoSPARQL standard let me query for all the museums within a mile of the Museum of Modern Art. The OSM data doesn’t include pictures of the museums, but I mentioned that it does include the museum’s Wikidata URIs, so today we’ll see how to use those URIs to retrieve the images from Wikidata and connect them to the data retrieved from OSM. The result of this process includes the images you see here, each linking to the pictured museum’s website.

Before I get to that I wanted to show a nice query that Ontotext founder Atanas Kiryakov showed me after I published that last blog entry. I had used curl to send a SPARQL CONSTRUCT query to OSM’s endpoint and save the triples in a local Turtle file. Once I had that file I loaded it into GraphDB and ran the query about museums near MoMA there. Atanas’s query uses the SPARQL SERVICE keyword to do the retrieval from within GraphDB so that all the steps that I did can happen with one query:

PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX uom:  <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>
PREFIX osmm: <https://www.openstreetmap.org/meta/>

SELECT ?museum ?museumName ?metersFromMoma where {
    SERVICE <https://sophox.org/sparql> {
        ?moma   osmt:official_name "The Museum of Modern Art" ;
                osmm:loc ?momaLoc .
        ?museum osmt:tourism "museum" ;
                osmt:name ?museumName ;
                osmm:loc ?museumLoc .     }   
    BIND(round(geof:distance(?museumLoc,?momaLoc, uom:metre)) AS ?metersFromMoma)
    FILTER(?metersFromMoma < 1610)  # Only those less than a mile away.
    FILTER(?museum != ?moma)        # Don't bother showing MoMA itself.   
} ORDER BY ?metersFromMoma

His query uses no features that are specific to GraphDB, so this query would work with any SPARQL engine that supports the GeoSPARQL standard—which in this case, means supporting that geof:distance() function call. GraphDB was the first triplestore I found that had this support.

To get pictures of the retrieved museums, I created a variation on Atanas’s query that retrieved triples about the Manhattan museums and inserted them into the active local repository:

PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>

INSERT { ?museum ?p ?o } WHERE
{
    SERVICE <https://sophox.org/sparql> {
    ?museum osmt:addr:city "New York";
            osmt:tourism "museum";
            ?p ?o .
    }   
}

The following query then showed me that each museum’s osmt:wikidata value in that locally stored data was a Wikidata identifier such as https://www.wikidata.org/wiki/Q636942 for the International Center of Photography:

PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>
SELECT * WHERE {
 ?museum osmt:addr:city "New York";
         osmt:tourism "museum";
         osmt:wikidata ?wikidataID . 
}

If you look at the Wikidata page for the ICP you’ll see that it includes a picture of it, and if you click on the “image” property name there you’ll see that this is property P18 in Wikidata. So, my next query took each of the Wikidata ID values of the museums and used the SERVICE keyword to send them off to Wikidata where it used them to retrieve image URLs, which it stored locally:

PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

INSERT { ?museum wdt:P18 ?imageURL} WHERE {
  ?museum osmt:addr:city "New York";
          osmt:tourism "museum";
          osmt:wikidata ?wikidataID . 
  SERVICE <https://query.wikidata.org/sparql> {
    ?wikidataID wdt:P18 ?imageURL. 
  }
}

(As always, I first ran the query above with the CONSTRUCT keyword instead of INSERT just to make sure that I was properly asking for what I was trying to get.)

The OSM data that I pulled included website URLs for most of the museums, so I queried the data I had aggregated from the two endpoints to list the websites and image URLs for museums within a mile of MoMA (actually, within 2 miles to give me a nicer choice of pictures to include here):

PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX uom:  <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX osmm: <https://www.openstreetmap.org/meta/>

SELECT ?website ?imageURL WHERE {
   ?moma   osmt:official_name "The Museum of Modern Art" ;
           osmm:loc ?momaLoc .
   ?museum wdt:P18 ?imageURL ;
           osmt:website ?website ;
           osmm:loc ?museumLoc . 
   BIND(round(geof:distance(?museumLoc,?momaLoc, uom:metre)) AS ?metersFromMoma)
   FILTER(?metersFromMoma < 3220)  # Only those less than 2 miles away.
   FILTER(?museum != ?moma)   # Don't bother showing MoMA itself.   
}

When displaying query results, GraphDB adds a handy “Download as” button, so I saved a tab-separated value version of that query’s results and used the ancient Linux utility sed to wrap the values in a bit of HTML:

sed -E "s/(.+)\t<(.+)>/\<a href='\1'> \
<img width='200' src='\2'\/><\/a>/" query-result.tsv > temp.html

I could then copy the bits of HTML from the resulting file to the text file I’m typing now so that the images you see can be links to the home pages.

If you’re reading this more than a few months after November of 2020 and the URLs of any of those images have changed, they’ll show up as broken links. With any application that uses data from remote sources, we have to consider various factors when making the decision whether to dynamically grab certain data when necessary or grab it once and store it locally for future use. Isn’t it nice how SPARQL and the widely-implemented open source and commercial RDF tools out there give us so many options when we make these decisions?

Have you ever pulled data from two different endpoints to answer a question that neither endpoint could answer by itself? Let me know at @bobdc.

GeoSPARQL queries on OSM Data in GraphDB

Bob DuCharme — Sun, 25 Oct 2020 11:40:00 +0000

Or, Querying geospatial data with SPARQL Part 2

Over a year ago, in Querying geospatial data with SPARQL: Part 1, I described my dream of pulling geospatial data down from Open Street Map, loading it into a local triplestore, and then querying it with queries that conformed to the GeoSPARQL standard. At the time, I tried several triplestores and data sources and never quite got there. When I tried it recently with Ontotext’s free version of GraphDB, it all turned out to be quite easy.

For some background, read that blog entry up through the paragraph beginning “The geosparql.org website has some preloaded data…” The rest of the entry describes my only somewhat successful attempts to do geospatial queries with Blazegraph and Parliament and how I looked forward to Apache Jena’s growing GeoSPARQL support. (A few years earlier I wrote a bit about GeoSPARQL in Visualizing DBpedia geographic data with some help from SPARQL.)

GraphDB

The GraphDB page that I link to above includes a chart that shows that the free version does plenty, and most importantly, doesn’t expire or limit the amount of data you load. Once I downloaded it, installed it, started it up, and had it running at http://localhost:7200, its web-based interface had a tutorial to “(1) Create a repository (2) Load a sample dataset (3) Run a SPARQL query” so I went through those steps. When you use GraphDB’s form to create a new repository, you’ll see that the “Rulesets” field has a default value of “RDFS-Plus (Optimized)” and offers 10 other choices, including several OWL choices and an “Upload custom ruleset” option. The form also includes a “Supports SHACL validation” checkbox and other options, so these were all great to see.

Before trying GraphDB with geospatial data I wanted to test out its support for inferencing and for RDF* and SPARQL*. I had a nice short example ready to go at my blog entry RDF* and SPARQL*: Reification can be pretty cool after the paragraph beginning “Blazegraph lets you do inferencing, so I couldn’t resist mixing that with RDF* and SPARQL*.” Treating two triples as resources themselves (thanks, RDF*!), the sample data in that example makes one triple an instance of d:Class2 and the other an instance of d:Class3, and then it makes both of those classes subclasses of d:Class1 without creating any instances of d:Class1. The query that follows this sample data doesn’t just ask for the instances of d:Class1, which GraphDB’s RDFS-Plus support will find in its subclasses; it asks for the subject, predicate, and object of each of these instances. (Thanks, SPARQL*!) It all worked fine in GraphDB.

Using GeoSPARQL with GraphDB

In my “Part 1” blog entry I described how a database manager’s ability to deal properly with geospatial data usually requires an add-on. GraphDB does use what they call a plugin for this, but there’s no need to download and plug it in yourself; it’s already in GraphDB and you turn it on by simply adding a triple to the repository setting geoSparql:enabled to True for some resource as described in their GeoSPARQL documentation. I got all of that page’s GeoSPARQL examples to work easily enough after loading the data that it pointed to.

In Part 1 I also wrote “Because I just love converting triples from one namespace to another so that I can use new tools and standards with them, I hoped to get some OSM triples and convert them to the right namespaces to enable geospatial queries on them using a local triplestore.” Having gotten the GeoSPARQL examples mentioned above to work in GraphDB I had a model to use when converting the OSM triples, and then I got a nice surprise: I didn’t have to convert them!

I pulled all the triples about museums in “New York” from the Open Street Map SPARQL endpoint with the following simple query:

PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>

CONSTRUCT { ?museum ?p ?o }
WHERE {
  ?museum osmt:addr:city "New York";
          osmt:tourism "museum";
          ?p ?o .
}

(Despite requesting “New York” museums, the results all seemed to be in Manhattan. An osmt:addr:city value of “Brooklyn” got other museums.)

After storing that query in the file manhattanMuseums.rq, the following curl command (split at the \ for display here) retrieved the triples and stored them in the file manhattanMuseums.ttl:

curl --data-urlencode "query@manhattanMuseums.rq" \
 https://sophox.org/sparql -H "Accept: text/turtle" > manhattanMuseums.ttl

(On October 25th when I first published this I thought that their SPARQL endpoint was down, but it turned out that my re-testing of the curl call was failing because of my own dumb typo.)

Here are two triples that it retrieved about one museum that I highly recommend:

osmnode:368061660 osmm:loc "Point(-73.9900266 40.7187837)"^^geo:wktLiteral ;
	<https://wiki.openstreetmap.org/wiki/Key:name> "Lower East Side Tenement Museum" .

Why no need to convert the data?

Here is the cool part that meant that I didn’t have to convert any triples before loading manhattanMuseums.ttl into GraphDB and issuing standard GeoSPARQL queries on it: while SPARQL has a perfectly decent selection of data types, you can define your own, and section 8.5.1 of the GeoSPARQL specification defines the http://www.opengis.net/ont/geosparql#wktLiteral datatype for specifying geospatial coordinates. As you can see in the Tenement Museum example above, the OSM triples use that type, so I was all set.

In Part 1 I also wrote “A proper geospatial query for something like all the museums within a mile of the Museum of Modern Art is more complicated because of the effect of the earth’s curvature.” It’s not so complicated with proper GeoSPARQL support because I can call the geof:distance function, which is not supported by Open Street Map’s SPARQL endpoint but is supported by GraphDB as part of its GeoSPARQL support. I loaded manhattanMuseums.ttl into GraphDB and ran the following query:


PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX uom:  <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>
PREFIX osmm: <https://www.openstreetmap.org/meta/> 

SELECT ?museumName ?metersFromMoma
WHERE  {
   ?moma   osmt:official_name "The Museum of Modern Art" ;
           osmm:loc ?momaLoc .
   ?museum osmt:tourism "museum" ;
           osmt:name ?museumName ;
           osmm:loc ?museumLoc . 
    # Find the distance from each museum to MoMA and save it
    BIND(round(geof:distance(?museumLoc,?momaLoc, uom:metre)) 
        AS ?metersFromMoma)
    FILTER(?metersFromMoma < 1610)  # Only those less than a mile away.
    FILTER(?museum != ?moma)        # Don't bother showing MoMA itself.
}
ORDER BY ?metersFromMoma

(I tried pulling address data as well, but not all museums had that, especially the ones that were close to MoMA.) With that query pasted into the file museumsNearMoma.rq, the following pulled a TSV version of the results from my locally running copy of GraphDB…

curl --header "Accept: text/tab-separated-values" --data-urlencode \
  "query@museumsNearMoma.rq" http://localhost:7200/repositories/OSMManhattanData

so that I could paste them here:

?museumName	?metersFromMoma
Paley Center for Media	88
Museum of Arts and Design	766
International Center of Photography	827
National Geographic Encounter - Ocean Odyssey	925
American Folk Art Museum	1350
Frick Collection	1399
Asia Society	1450
Mount Vernon Hotel Museum	1503

GeoSPARQL has a lot more for GIS geeks than the geof:distance function, so check out the spec for that. Also, after I wrote the first draft of this blog entry, I found out on Twitter about a new document from the Open Geospatial Consortium, the standards group responsible for GeoSPARQL: OGC Benefits of Representing Spatial Data Using Semantic and Graph Technologies. It lists nice use cases that show the benefits of semantic technologies, describes the use cases addressed by GeoSPARQL, and proposes some extensions to that specification.

There is also an excellent Linked Data/Knowledge Graph angle to my example above, especially for GLAM researchers: because the OSM data includes triples like this additional one about the Tenement Museum,

osmnode:368061660 <https://wiki.openstreetmap.org/wiki/Key:wikidata> wd:Q901533 .

you can connect up the geospatial data in OSM with triples from Wikidata to aggregate even more cool data about the entities in OSM. And, you can do it all in a local, free triplestore!

Using SPARQL do to quick and dirty joins of CSV data

Bob DuCharme — Sun, 27 Sep 2020 12:03:00 +0000

Or data with other delimiters.

I recently needed to join two datasets at work, cross-referencing one property in a spreadsheet with another in a JSON file. I used a combination of jq, perl, sort, uniq, and… I won’t go into details.

I wondered later if it would have been easier if I had used tarql (which I’ve blogged about before) to convert it all to RDF and then did the join with a SPARQL query. It turned out to be quite easy. A single SPARQL conversion query to run with tarql, after changing one line per dataset that I applied it to, was all I needed to create the RDF that let me do all the joins I wanted with additional simple queries. This will be even easier in the future as I re-use the conversion query with other datasets that I want to join.

To demonstrate this I will show how I did this to join three CSV files: a list of student names and IDs, a list of course names and IDs, and a list of student and course IDs that shows who took which courses.

I didn’t include field name headers in the data files because tarql would use them as property names and I wanted to make my scripts more generic by letting tarql use its default generic names of a, b, c, and so on through the alphabet for dataset property names.

Here are the data files, starting with students.txt:

s1001,Craig Ellis
s1002,Jane Jones
s1003,Richard Mutt
s1004,Cindy Marshall

courses.txt:

c2001,Linear Algebra I
c2002,Impressionists and Post-Impressionists
c2003,Intro to Theravada Buddhism
c2004,Democracy in the Gilded Age

studentCourse.txt:

s1002,c2001
s1002,c2004
s1003,c2001
s1003,c2004
s1004,c2001
s1004,c2003

I had two goals that would require joins: to list the student names next to the names of the courses they took, without showing any IDs, and then to list the course names with the number of students enrolled in each. The first step was to convert the delimited files to RDF with tarql; I could then write short queries to fulfill the two goals.

I used the query below to convert the students.txt file to RDF. The ?u a t:student triple pattern in the CONSTRUCT clause creates a triple saying “this row of data represents an instance of this class” so that the join queries will know which data represents what kinds of things. Modifying this script to handle other data types merely requires changing the object of this one triple pattern. For example, the query that converts courses.txt to Turtle has t:course in that triple pattern instead of t:student.

# constructAllStudents.rq
PREFIX t: <http://learningsparql.com/ns/tarql/>

CONSTRUCT {
   ?u a t:student .

   ?u t:a ?a .
   ?u t:b ?b .
   ?u t:c ?c .
   ?u t:d ?d .
   ?u t:e ?e .
   ?u t:f ?f .
   # As many as you want. Can be more than the number of input columns.
} 
 WHERE {
   BIND (UUID() AS ?u)
}

This query tells tarql to use that query to create the Turtle file for that dataset:

  tarql -H constructAllStudents.rq students.txt > students.ttl

I used similar command lines with the slight variations described above on that CONSTRUCT query to create courses.ttl and studentCourse.ttl.

Here is the SPARQL query that uses the data from those three Turtle files to join the student names with the course names. Your JOIN query will look a little different from mine, but not too different, because the use of properties such as t:a, t:b, and t:c that correspond to tarql variables like ?a and ?b (instead of more specific names from a data file header line) let me make the query more generic.

# joinThem.rq

PREFIX t: <http://learningsparql.com/ns/tarql/>

SELECT ?studentName ?courseName WHERE {

   ?student a t:student ;
            t:a ?studentID ;
            t:b ?studentName .

   ?course a t:course ; 
            t:a ?courseID ;
            t:b ?courseName .
   
   ?class a t:studentCourse ; 
           t:a ?studentID ;
           t:b ?courseID .
}

The arq SPARQL processor’s ability to accept more than one -data argument let me use a single command to run this join query with the three Turtle files as input:

arq --query joinThem.rq -data courses.ttl -data students.ttl -data studentCourse.ttl

Here is the result:

----------------------------------------------------
| studentName      | courseName                    |
====================================================
| "Richard Mutt"   | "Linear Algebra I"            |
| "Richard Mutt"   | "Democracy in the Gilded Age" |
| "Jane Jones"     | "Linear Algebra I"            |
| "Jane Jones"     | "Democracy in the Gilded Age" |
| "Cindy Marshall" | "Linear Algebra I"            |
| "Cindy Marshall" | "Intro to Theravada Buddhism" |
----------------------------------------------------

This data enables joins for other purposes as well. This next query joins the data and then shows course names with the number of students enrolled in each one:

# coursePopularity.rq

PREFIX t: <http://learningsparql.com/ns/tarql/>

SELECT ?courseName (COUNT(*) as ?students)
WHERE {

   ?student a t:student ;
            t:a ?studentID ;
            t:b ?studentName .

   ?course a t:course ; 
            t:a ?courseID ;
            t:b ?courseName .
   
   ?class a t:studentCourse ; 
           t:a ?studentID ;
           t:b ?courseID .
}

GROUP BY ?courseName
ORDER BY DESC(?students)

The command to run this just substitutes the new query for the previous one on the command line used earlier:

arq --query coursePopularity.rq --data courses.ttl -data students.ttl -data studentCourse.ttl

And here is the result:

--------------------------------------------
| courseName                    | students |
============================================
| "Linear Algebra I"            | 3        |
| "Democracy in the Gilded Age" | 2        |
| "Intro to Theravada Buddhism" | 1        |
--------------------------------------------

I want to reiterate that the trick described here is strictly for quick-and-dirty joins. Calling the first property in all the datasets a, the second one b, the third c, and so on is just a convenience to reduce the amount of query editing needed. If I was going to convert data like this for long-term usage I would use more descriptive names for each one (maybe even take advantage of property names in header rows) and add some more modeling triples that define classes, properties, and their relationships.

I’m starting to think of tarql and arq as members of the Linux command toolbox that includes venerable old tools like sort and uniq as well as recent tools such as jq and xmllint that I’m seeing in more Linux standard distributions. (There’s actually one called join that can do simple joins with two files but no more than two.) I can mix and match different combinations of these tools to perform many different tasks with many different kinds of data with no need to crank up some server or memory-intensive GUI tool.

Cropped photo of “JOIN” sign by Marcel van Schooten via flickr (CC BY 2.0)

Generating MODS XML from RDF with Go templates

Bob DuCharme — Sun, 30 Aug 2020 12:20:00 +0000

Using a built-in Go(lang) feature to drive an RDF application.

I had heard that Go (also known as “golang”) was an increasingly popular newish programming language before I migrated my blog from being generated by handmade XSLT scripts on snee.com to using the Hugo platform to generate it on bobdc.com. Hugo is written in Go, which was invented at Google (get it?) by three people, two of whom had contributed to the development of C, Unix, and important related technology at Bell Labs. Go provides an excellent basis for a website generation system because, although it prides itself on a fairly minimal core feature set, it provides templating of output with its standard libraries. As I wrote when I described the website migration, I never had to learn the programming language to get the website up and running, but I tweaked many Hugo templates to customized the website’s appearance.

This made me wonder whether Go and its templates would be a good way to generate content from RDF. Short answer: yes. After learning some Go I wrote a program that reads in triples, loads them into an appropriate data structure, and then hands that off to a template for output. Once I’d written the program, most of my work consisted of building up the template text file with little need to go back and tweak and recompile the Go code.

My demo project was to convert journal publishing RDF metadata into MODS XML. The Metadata Object Description Schema standard is hosted at the Library of Congress and is very popular for library metadata. It has its own RDF vocabulary, but a MODS to RDF Working Group decided to “consider a range of widely-adopted RDF namespaces, rather than pursuing a straight XML-to-RDF approach using the MODS RDF Ontology or proposing a new formal ontology”. This quote comes from their “MODS to RDF Mapping Recommendations” (pdf), which describes how to use Dublin Core, Library of Congress, schema.org, Europeana, and other RDF vocabularies to express MODS metadata.

This idea appealed to me because I see great potential in modeling relationships between the rich metadata standards of the publishing and library worlds in order to help people take better advantage of combinations of these standards. Using RDFS (or more high-powered modeling tools such as OWL or SHACL, but maybe just RDFS), a single system can more easily support multiple standards because it knows that if another system expects a BIBFRAME title but the host system stores book titles as Dublin Core titles, RDFS triples defining these as equivalent can help to automate the delivery of whatever the destination system wants.

In the publishing world, metadata delivery is more likely to be in XML because the content itself is often in XML. (Many people forget: that’s why XML was invented.) So, after RDF-based tools reap the benefits of the modeling described above, eventual delivery often needs to be in XML.

Shortly after I started this I described my plan for using Go templates to my daughter and she told me about the Jinja Python template library. Using that would have made this all much easier for me, because I already know Python and a nice RDF library for it, but I wanted to try it with Go specifically because templating is a standard part of the language as opposed to a community add-on library. (For embedding RDF values in templated XML, slicker proprietary alternatives to my novice Go coding are also available from MarkLogic and TopQuadrant.)

The goal

For converting semi-realistic publishing RDF to MODS XML I took this sample journal XML from the MODS website and then wrote out an RDF version of that metadata using the MODS to RDF Mapping Recommendations mentioned above. I copied all of these triples with a new subject and slight changes to their objects so that they could play the role of metadata about a dummy second document; this let me test whether my program could output MODS data for multiple documents. Finally, I used xmllint to validate the result against the MODS XML Schema to ensure that the result was valid MODS XML.

Writing and running the Go code

Many resources for learning Go are available. This tour was fine for me. I postponed reading How to Write Go Code because it looked to be more about large-scale systems, but I should have read it earlier on to better understand how to import the RDF library (or, in Go terminology, “package”) that I used.

Somewhere in the middle of this project I started reading the book The Go Programming Language, which was co-authored by Brian Kernighan—another Bell Labs alum with plenty of impressive UNIX-related accomplishments to his credit, including co-authoring the seminal book “The C Programming Language” with Dennis Richie. (I had to look up that book’s title just now because everyone has referred to it as “The K&R” since it was published in 1978.) I’d been considering a return trip to the K&R recently but don’t need to now because the Go book is more or less the modern version of that book for a modern version of C. The book’s website includes the complete tutorial chapter, and I highly recommend it. I am tempted to put this wonderful line from the tutorial on my blog’s template so that it shows up underneath all of my blog posts: “In the interests of keeping code samples to a reasonable size, our early examples are intentionally somewhat cavalier about error handling”.

Having written the original SQL page for the Learn X in Y minutes site, I should have thought to look at its Go page sooner. It’s a concise, handy resource that gives you a broad tour of the language quickly.

Go has clear roots in C. It’s easier, though, with no pointer arithmetic or malloc memory management to worry about. I was surprised at how often I wanted to make it do something I hadn’t done with it before and got it to work by the second try.

The standard Go packages include one for creating text templates and one for HTML templates. The HTML one includes some extra bits to protect against code injection and does not pass along  in the template to the output. I didn’t notice any other differences and found the HTML one to be fine for generating MODS XML.

While Go packages are available to ease the querying of SPARQL endpoints, there is currently no Go equivalent of Python’s RDFlib, which has its own SPARQL engine. I used the knakk/rdf Go package to read the triples out of the disk files that provide my program’s input. (As a bonus, this package can read several different RDF serializations.) My program was really just a variation on the sample program that came with that package, so there are parts of my Go program that I don’t completely understand, but hey, it works. This package does have godoc documentation available.

This Yury Pitsishin blog post was a good way to get started with templates. You can define the templates within the Go source code but will more typically put them in a separate document. This offers the benefit of letting you tune the output without recompiling the conversion code. The Go code’s template definition uses the Template.ParseFiles method to specify the external file to use as a template, and then in the code the defined template’s Execute method passes along a data structure that has been populated in the program to use with the output template.

(Because this blog entry is getting long and my current Go skills are not something to show off, I’m not including my sample code here. You can find the Go code, template, sample input, and sample output on github.)

The data structure that my program passes to my template is a map of maps that I called docsMetadata because it stores the metadata for a set of documents. A Go map is like a Python dictionary, letting you store and retrieve things using a key. The docsMetadata keys are subject URIs—three different subject URIs used in a given docsMetadata instance would be specifying metadata for three different documents—and the things stored with them are maps whose keys are predicate URIs. Those keys give access to simple arrays (well, actually, “slices”, which are Go’s dynamic version of arrays) so that I can store more than one value for a given subject-predicate combination such as the following two triples from my sample input:

<https://example.org/objects/1> dce:subject "College librarians--Recruiting" .
<https://example.org/objects/1> dce:subject "College librarians--United States" .

Outside of the dce:subject values my demo rarely uses any slice entries beyond the first one. The MODS schema does allow multiple values for many of its metadata properties, so this made a simple way to store more than one publisher address, media type, or other property if necessary. Also, the knakk RDF package does let you check whether a triple’s object is a URI, a literal value, or a blank node, so I could have done some fancier RDF processing of those. As a proof of concept demo I thought it best to just treat them all as strings for now.

Developing the MODS XML template

You can put pretty much any text you want in a Go template. Nested pairs of curly braces store codes that give instructions to the compiled Go program about what to do with the data being passed in; often, this means inserting a particular component of that data structure. If the program passes an Employee data structure that has a Name field, then when the program sees “Hello, {{.Name}}” in the template it will replace the curly brace expression with the value of the Name field. If you edit that part of the template file to say “Hello, {{.Name}} at {{.Address}}” you can then run the program and see the new version of the output with no need to recompile the program.

Go’s templating language includes special keywords for tasks like conditional formatting. For example, if you don’t have Address values for all of the employees, you could format the phrase above to only include " at " and the address value if the there actually was an address value: “Hello, {{.Name}}{{if .Address}} at {{.Address}}{{end}}”.

You can see more of the special template codes at Golang Templates Cheatsheet. An important one for my MODS project was range, which lets you iterate over multiple values for a given property. I used it for the dce:subject values mentioned above and also to enclose nearly the whole template so that I could output metadata about multiple journal documents. Because I was passing a map of maps to the template, referencing just the right bits was not as simple as pulling a Name value out of an Employee data structure, but it wasn’t too bad.

One downside to working with these templates is the cryptic error messages caused by template problems. A missing curly brace could lead to an error message of “panic: runtime error: invalid memory address or nil pointer dereference” with no line numbers about the template problem or other helpful information. Instead of celebrating Brian Kernigan’s cavalier approach to error handling in examples I should probably dig further into Go’s facilities for that.

Running it

The knakk sample program uses an in command line parameter to indicate the input file, so mine does too:

rdf2modsxml -in modsjournals2.ttl > modsjournals2.xml

The output had a lot of extra blank lines, which ultimately don’t matter in XML, but I sometimes ran the program like this to remove them:

rdf2modsxml -in modsjournals2.ttl | awk 'NF > 0' > modsjournals2.xml

(Raise your hand if you know what the “k” in “awk” stands for.)

Was it valid MODS XML? As I mentioned above, I used xmllint (which seems to be part of most standard Linux distributions now and can be downloaded for Windows or MacOS) to validate the result against the MODS XML Schema.

xmllint --schema mods-3-4.xsd modsjournals2.xml --noout

The --noout parameter tells xmllint not to show any of the content and to just indicate whether the XML document conforms to the schema or not. The output of my rdf2modsxml program did conform.

Go and RDF and publishing and library metadata

If the development of a useful new tool requires the writing of code that imports libraries and then needs to be compiled to a binary version, that can be asking a bit much of people who are not full-time software developers. If that code is fairly simple (with a package to do the most difficult part already available) and the main work of using the tool consists of just editing a separate text file, then I think that the use of Go templates for RDF application development offers some real promise. With Hugo as a model, this could obviously be done to use RDF data in applications destined for browsers; I was especially happy to see that it works to generate XML that conforms to an important standard unrelated to HTML.

I’m also going to start being braver about messing around with the Hugo templates used to generate this blog!

The HTML interface to your SPARQL endpoint is not your SPARQL endpoint

Bob DuCharme — Sun, 19 Jul 2020 11:15:00 +0000

Remember what the 'P' in 'SPARQL' stands for.

If you have interesting data, we want to use it in application development!

Something that happens to me now and then: I’ll hear that an organization with a lot of interesting data (science, music, whatever) makes the data available on a SPARQL endpoint. I send my browser to the URL listed as the SPARQL endpoint and I see a web form. I enter a simple query on the web form to retrieve a few random triples, click the form’s button, and the results of my query appear. Then I enter fancier queries to explore the endpoint’s data.

Then, if there is a clear indication of an endpoint URL that is different from their form’s URL, I append /query= and an escaped version of a simple query to it so that I can send the query to the endpoint with curl. If I see no clear indication of an endpoint URL that is different from this form’s URL, I’ll look around the website a bit for it, and if I still have no luck I’ll try using the form’s URL and several variations on it. (Below are some hints on these variations.)

Sometimes I just can’t find a working endpoint URL. There are sites out there advertising a SPARQL endpoint where the only way to send a query to the endpoint is via the HTML form interface. I won’t name specific sites here, but it’s definitely a pattern I’ve noticed.

“SPARQL” stands for “SPARQL Protocol and RDF Query Language”. The SPARQL 1.1 Protocol specification tells us “This document specifies the SPARQL Protocol; it describes a means for conveying SPARQL queries and updates to a SPARQL processing service and returning the results via HTTP to the entity that requested them.” It also tells us that a SPARQL Protocol service is “[a]n HTTP server that services HTTP requests and sends back HTTP responses for SPARQL Protocol operations. The URI at which a SPARQL Protocol service listens for requests is generally known as a SPARQL endpoint”.

An “endpoint” that doesn’t support this protocol is not a SPARQL endpoint. Curl provides many ways to send a query via HTTP and then process the results—my mention of it above links to something I wrote with several examples—and it’s a great way to test a proper endpoint.

It’s not about curl, though; curl is just a great way to explore a service’s HTTP support. Any modern programming language supports HTTP, which means that you should be able to write a program in any of these languages that sends a request to a SPARQL endpoint and then processes the result without needing any special SPARQL or RDF library. (Of course, there are many such libraries to make this processing even easier.) The curl utility just provides a convenient way to do quick and dirty tests of a SPARQL endpoint from the command line. The ability to do this from the command line, and from within a programming language that provides HTTP support, means that you can automate the execution of these queries and then mix and match the results with other processing to create cool applications. If the only way to issue SPARQL queries against your data is to enter a query on a web form and then click a button, then I can’t use your data in this kind of application development. If you have interesting data, we want to use it in application development!

Finding the endpoint

Ideally, the announcement of the endpoint tells you both the URL for endpoint where you send HTTP requests and the URL for a web form front end to that endpoint. For example, DBpedia’s endpoint is at http://dbpedia.org/sparql and the web form interface is at http://dbpedia.org/snorql/, where it uses a UI tool called “snorql.” Note that the snorql form says “SPARQL Explorer for http://dbpedia.org/sparql" right at the top. That’s the kind of clarity about the relationship between the form and the endpoint that I want to see more of out there. The yago endpoint form also does this nicely.

Some places use the same URL for both the endpoint and the web interface to the endpoint, such as the European Bioinformatics Institute’s endpoint at https://www.ebi.ac.uk/rdf/services/sparql, the AGROVOC Thesaurus endpoint at http://agrovoc.uniroma2.it/sparql, and JazzCats one at http://cdhr-linkeddata.anu.edu.au/jazzcats-sparql/sparql. Using the same URL doesn’t mean that the HTML interface to the SPARQL endpoint is the same as the endpoint itself; their HTTP servers have some step noting whether a query parameter was passed with with the URL, and if not, they deliver the HTML page with the web form if it does.

You’ll notice how these endpoint URLs all end in /sparql. Not all SPARQL endpoints do, but it’s a nice convention. If a SPARQL endpoint web form is at http://www.example.com and I see no clear indication of an endpoint URL, I’ll try http://www.example.com/sparql as an endpoint by appending a query parameter with a URL-escaped version of a very simple query such as “SELECT * WHERE { ?s ?p ?o } LIMIT 5”. With curl, I can then test it with this:

curl http://example.com/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%205

If that doesn’t work (for example, if the curl request gets you nothing or the HTML of an error message page) and the URL begins with “http://”, try adding an “s” after the “p”. Once you do get a SPARQL result set from and endpoint, it’s typically XML of the query results and you can start exploring ways to get other formats such as JSON or TSV. (Again, see my curling SPARQL post for a quick tour of some possibilities.)

You can also email the people running the site and say “Hey! Great data! I enjoyed entering queries on your form! Does your site have a SPARQL endpoint that supports the SPARQL protocol so that I can get the data with curl and other HTTP tools instead of just using a browser to see rendered HTML of the results?” It’s one of the reasons that I’m writing this blog entry—so I can just point to this long-winded explanation of the difference instead of trying to do a short summary in another email to one of those sites.

Converting CSV to RDF with Tarql

Bob DuCharme — Sun, 21 Jun 2020 11:00:00 +0000

Quick and easy and, if you like, streaming.

I have seen several tools for converting spreadsheets to RDF over the years. They typically try to cover so many different cases that learning how to use them has taken more effort than just writing a short perl script that uses the split() command, so that’s what I usually ended up doing. (Several years ago I did come up with another way that was more of a cute trick with Turtle syntax.)

A year or two ago I learned about Tarql, which lets you query delimited files as if they were RDF triples, and I definitely liked it. It seemed so simple, though, that I didn’t think it was worth a whole blog post. Recently, however, I was chatting with Dave McComb of Semantic Arts and learned that this simple utility often plays a large role in the work they do for their clients, so I played some more with Tarql. I also interviewed Boris Pelakh of Semantic Arts about what kinds of tasks they use Tarql for in their customer work.

I downloaded Tarql from https://github.com/tarql/tarql/releases, unzipped it, found a shell script and batch file in a bin subdirectory of the unzipped version, and was ready to run it.

I’ll just jump in with a simple example before discussing the various possibilities. Here is a file I called test1.csv:

name,quantity,description,available
widget,3,for framing the blivets,false
blivet,2,needed for widgets,true
"like, wow",4,testing the CSV parsing,true

Here is a sample query to run against it:

# test1.rq
SELECT ?name ?quantity ?available
WHERE {}

From the command line I tell Tarql to run the test1.rq query with test1.csv as input:

tarql test1.rq test1.csv

Here is the result:

--------------------------------------
| name        | quantity | available |
======================================
| "widget"    | "3"      | "false"   |
| "blivet"    | "2"      | "true"    |
| "like, wow" | "4"      | "true"    |
--------------------------------------

The first thing I like here is that the comma in “like, wow” doesn’t cause the problems that I had when using the perl split() function, which split lines at every comma—even the quoted ones. (Perl has a library to get around that, but finding it and installing it was too much trouble for such a simple task.)

If the query above had specified the dataset with the FROM keyword, like this,

# test2.rq
SELECT ?name ?quantity ?available

FROM <file:test1.csv>
WHERE {}

then I wouldn’t have to mention the data source on the command line,

tarql test2.rq

and I would get the same result.

To really turn the data into triples, we can use a CONSTRUCT query. The following does this with the same data and, because Tarql treats everything as a string, it casts the quantity values to integers and the available values to Booleans:

PREFIX ex:  <http://www.learningsparql.com/ns/example/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT { 
   ?u ex:name ?name ;
      ex:quantity ?q ;
      ex:available ?a . 
}
FROM <file:test1.csv>
WHERE { 
  BIND (UUID() AS ?u) 
  BIND (xsd:integer(?quantity) AS ?q)
  BIND (xsd:boolean(?available) AS ?a)
}

Here is the result:

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix ex:  <http://www.learningsparql.com/ns/example/> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<urn:uuid:8a6ad6dc-1b2d-4900-a63f-d25286379a0a>
        ex:name       "widget" ;
        ex:quantity   3 ;
        ex:available  false .

<urn:uuid:66ddf7f2-8c37-4ecb-86cf-056234aad317>
        ex:name       "blivet" ;
        ex:quantity   2 ;
        ex:available  true .

<urn:uuid:c8db5512-3772-4193-a172-525181a712de>
        ex:name       "like, wow" ;
        ex:quantity   4 ;
        ex:available  true .

The Tarql documentation shows a lot more options and its examples page shows several cool things. And, of course, you have the full power of SPARQL to manipulate the data that you’re pulling from tables; one example is my use of UUID() function in the CONSTRUCT query above. Another nice example is a federated query with Tarql that John Walker put together.

The original version of Tarql is among many contributions that Richard Cyganiak has made to RDF-related software over the years. As he told me in an email,

I started the project in 2013 when I was still at NUI Galway (formerly DERI), with large contributions from my then-colleague Fadi Maali, and Emir Munoz from Fujitsu Labs. We were working with open data from a number of government data catalogs at the time, and this data often came as CSV files. Tarql started out as a quick hack to help with ingesting that data into our RDF-based tools. The hack proved quite successful. But to this day, Tarql is really just a thin wrapper around Apache Jena’s ARQ query engine. All the hard work happens there.

One important point in the design is that it can stream. That is, only a small part of input and output need to be kept in-memory at any given time. That makes it work well on large CSV files. Again, Jena made it possible by providing building blocks that support streaming operation.

It’s a testament to the design of SPARQL, really. The syntax is so nice and concise, and the underlying model so flexible, that it can be adapted to quite different tasks.

Because Richard pointed out that it can stream, I wanted to show this alternative to my first command line above, which does the same thing but using input from stdin:

tarql --stdin test1.rq < test1.csv

I asked Semantic Arts’ Boris Pelakh a few things about the role that Tarql plays in the work that Semantic Arts does for their customers and it turns out that it’s a pretty big role.

Boris, to start, tell me a little about what Semantic Arts does and where Tarql fits in.

Semantic Arts provides consulting services, helping companies transform their data models to a semantic graph paradigm, while helping them achieve data harmonization and improve comprehension and efficiency. We use Tarql to transform tabular data (either spreadsheets or SQL exports) into RDF for further processing. It is an essential part of our ETL process.

Where do the tables come from that you’re feeding to Tarql? From customer data or from tables that Semantic Arts staff develop as part of their research into the company?

It is primarily customer data—bulk CSV, XLS, or SQL exports. In almost all our engagements, the customers already have a large volume of data, either in relational databases or some sort of data warehouse using something like Hadoop or S3. We have used Tarql to transform transaction data, asset inventories, dataset metadata, and so forth.

What do you do with the Tarql output?

It is generally loaded into our local AllegroGraph store during the development process or the client’s chosen triple store during production. We then help our clients build semantic applications on top of that triple store. I have also set up ETL pipelines where Tarql runs in EC2 instances and uses S3 to load the generated RDF into Neptune for a scalable solution.

I believe Semantic Arts helps customers come up with some overall business process schemas or related artifacts; having this data in a triplestore like Allegrograph probably helps a lot with that.

Absolutely. In several engagements we were able to run graph analytics on the imported data for insights as well as running validation, either via SHACL or SPARQL, to help improve data quality.

What kinds of roles does that RDF play in the deliverables that you eventually provide to the customer?

In our view, RDF provides all the best features of both relational and property graph databases, and is an ideal foundation for an enterprise data system. We help our customers migrate their siloed data into a unified, semantic model defined by an enterprise ontology that we help build. Then, we develop a semantic application stack (APIs and UIs) that take advantage of the newly enriched data.

So triples in Allegrograph provide the raw material for what eventually ends up as the enterprise ontology.

While we use AllegroGraph internally, we do not mandate a specific triple store to our customers, instead working with their preferred infrastructure. We have worked with Stardog, AWS Neptune, and MarkLogic, among others. But yes, the instance data created via Tarql, along with classes and properties defined in Protege, provided a unified enterprise ontology for the customer to use.

Tarql provides a lot of potential command line switches to use. Are there interesting ones that you feel many people miss out on?

The --dedup option added in Tarql 1.2 (I believe) helps reduce the size of the generated RDF by avoiding the generation of duplicate triples. Tuning the deduplication window size is a careful compromise between the memory footprint of the transform and the output size, and is tuned per pipeline. The support for apf:strSplit, which allows for the generation of multiple RDF result sets from a single input line, has also been helpful in the past, though that is internal to the query and not a command line option.

Tarql has some “magic” predicates?

Yes. For example, tarql:expandPrefix is very useful when minting URIs in a CONSTRUCT query. It avoids hard-coding of namespaces into the transformation, lending flexibility and ease of maintenance. Also, the magic ?ROWNUM variable that Tarql provides into the bindings is nice for generating unique IRIs when the data set does not have unique keys.

Boris is also working on a Python implementation of Tarql called pyTARQL that looks like it could be useful for a lot of developers.

SPARQL in a Jupyter Notebook

Bob DuCharme — Sun, 31 May 2020 11:15:00 +0000

For real this time.

A few years ago I wrote a blog post titled SPARQL in a Jupyter (a.k.a. IPython) notebook: With just a bit of Python to frame it all. It described how Jupyter notebooks, which have become increasingly popular in the data science world, are an excellent way to share executable code and the results and documentation of that code. Not only do these notebooks make it easy to package all of this in a very presentable way; they also make it easy for your reader to tweak the code in a local copy of your notebook, run the new version, and see the result. This is an especially effective way to help someone understand how a given block of code works.

When these notebooks were first invented they were known as IPython (“Interactive Python”) notebooks. At the time, all the executable code was Python, but since then the renaming to “Jupyter” has been accompanied by support for more and more languages—even multiple languages in the same notebook. It wasn’t supporting SPARQL yet when I wrote the post described above, but my “just a bit of Python to frame it all” automated the handoff of SPARQL queries to the rdflib Python library so that ideally even someone who didn’t know Python could enter SPARQL queries into a notebook and see the results as part of the notebook.

The wait for the real thing is over. Paulo Villegas has released a SPARQL kernel for Jupyter notebooks that lets us run queries natively, and I have been having some fun with it. The project’s sparql-kernel git repository has good documentation in its readme file. There’s no need to clone the project; the following three commands installed the sparqlkernel files locally for me, installed those into my copy of Jupyter, and then started up Jupyter.

pip install sparqlkernel
jupyter sparqlkernel install --user bob
jupyter notebook

At this point I was looking at Jupyter in my browser, and when I clicked the “New” button to create a new notebook I saw SPARQL as a choice right under “Python 3”.

While the SPARQL processing in my earlier post about Jupyter was handled by rdflib, this SPARQL kernel functions more as a very nice interface to a SPARQL endpoint that you specify. Or endpoints that you specify—as we’ll see, it’s very easy to switch between endpoints in one notebook. You specify the endpoint to talk to using a Jupyter “magic” command, which is a special command that begins with a percent sign.

Once I was set up with this, I created a new notebook titled Jupyter and SPARQL and Dort or Dordrecht where you can read and see the various steps I took to retrieve triples from two different endpoints about a famous J.M.W. Turner painting. (Another great thing about Jupyter: github understands it well enough to host the notebooks so that they look the same as they look in a browser pointing at a local Jupyter server. Sometimes when I follow the link to my new notebook, after a minute it tells me “Sorry, something went wrong” and displays a “reload” button, and then after clicking that button it usually works pretty quickly.) You can see the results of my queries right in the notebook, and if you download it and install Jupyter and sparql-kernel you can modify the queries and rerun them yourself. (For the notebook’s last query you’d need a triplestore such as Fuseki running locally at localhost:3030. It doesn’t even have to have any data in it; as you’ll see in my new notebook, I used Fuseki to execute a federated query across the other two endpoints.)

While creating my new notebook, sometimes I was about to plug a new query into it and thought “I should put this query into its own file and send it off to to the endpoint with curl just to make sure it works properly” because that’s such a reflex reaction for me. For trying out queries and iteratively tuning them, though, doing them right in the notebook is much easier than editing a text file and sending it off to the endpoint with a shell command, because I can see the query and results (or errors) right there in the same glance. Despite being a diehard Emacs guy I’m pretty confident that this will be my new routine from now on. When I develop multiple related queries in parallel, although I love Emacs’ sparql-mode (which also hooks up to an endpoint and shows your result right with your query), I still have to keep track of which query is in which buffer. In a Jupyter notebook, I can put nicely-formatted text blocks before and after each query to describe what each query is supposed to do and to annotate my progress with each query.

I don’t want to write things here that are redundant with my new notebook about the Turner painting or with my earlier blog entry about Jupyter, so I encourage you to read the latter if you’d like to learn more about why Jupyter notebooks are so great and the former if you want to see the new powers that sparql-kernel adds to Jupyter for SPARQL users. I know I’m going to be a much more regular user of this nice tool.

(Note: just yesterday I learned that Jupyter’s competitor Apache Zeppelin also has a SPARQL plugin, so that is something to check out as well.)

Living in a materialized world

Bob DuCharme — Sun, 26 Apr 2020 11:15:00 +0000

Managing inferenced triples with named graphs.

I’ve often thought that named graphs could provide an infrastructure for managing inferenced triples, and a recent Twitter exchange with Adrian Gschwend inspired me to follow through with a little demo.

Before I describe this demo I’m going to review some basic ideas about RDF inferencing and database denormalization. Then I’ll describe one approach to managing your own inferencing with an RDF version of database denormalization.

Inferencing

As I wrote in the “What Is Inferencing?” section of the “RDF Schema, OWL, and Inferencing” chapter of my book Learning SPARQL, “Webster’s New World College Dictionary defines ‘infer’ as ’to conclude or decide from something known or assumed.’ When you do RDF inferencing, your existing triples are the ‘something known,’ and your inference tools will infer new triples from them.” If you have triples saying that Lassie is an instance of dog, and dog is a subclass of mammal, and mammal is a subclass of animal, then an inferencing tool such as a SPARQL engine that implements RDFS will recognize the implications of the rdfs:subClassOf predicate used to make the last two statements. This means that if you query for all instances of mammal or animal it will include Lassie in the result.

The “Using SPARQL to Do Your Inferencing” section of that same chapter shows how a query like the following can implement some inferencing for this RDFS property if your SPARQL engine doesn’t have this feature built in:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

CONSTRUCT { ?instance a ?super . }
WHERE { 
  ?instance a ?subclass . 
  ?subclass rdfs:subClassOf ?super . 
}

You’d need to write such rules for all of the parts of RDFS and OWL that you wanted to implement—and even that might not be enough. Once the query above created a triple saying that Lassie is a mammal, it would be done, but a proper inferencing engine would then infer from that new triple that Lassie is also an animal.

The above technique can still be useful for simple inferencing like implementation of the rdfs:subPropertyOf property for data integration, as long as your subproperties don’t have subproperties, so I’ll call this technique “one-pass inferencing.” (I wrote about the implementation of similar inferencing in Driving Hadoop data integration with standards-based models instead of code.)

Database denormalization

To oversimplify a bit, relational database normalization is the process of working out which properties should be stored in which tables to avoid redundancy, because redundancy generally leads to inefficiency. When you store your customer’s addresses and information about each item that they ordered, you don’t want these in the same table; if one customer ordered three different items, then storing a copy of the address with all three items would take up unnecessary space and make it more difficult to update the address if that customer moves. If you store a unique customer number with the address in the customers table and also with each of the customer’s orders in a separate orders table, then when you want to list customer addresses with the items that each customer ordered, you tell the database system to do a join of the tables using the customer number to cross-reference the information.

Sometimes day-to-day operations of a large database system require millions of complex joins to fulfill common requests. This can lead a database administrator to introduce some redundancy in certain tables to increase the efficiency of these requests. We call this denormalization. Because of the potential problems of these redundancies, this requires careful management—perhaps clearing out and repopulating the denormalized tables every night at 2AM.

Storing RDF triples that could otherwise be inferred dynamically, or “materializing” those triples, is similar. They’re considered redundant because if you have all the information necessary to infer a certain piece of information, why store that information in your dataset? Because repeated inferencing of that information will require repeated usage of compute power to perform the same task. When you’re doing SPARQL queries this also limits your choice of SPARQL processors, because different SPARQL processors support different levels of inferencing depending on their support for RDFS and different OWL profiles. Many can’t do any inferencing at all.

Because you can do your own one-pass inferencing with CONSTRUCT queries (and with INSERT queries if you are using a triplestore that supports SPARQL UPDATE), you can do your own materializing to get the effects of denormalization.

Using named graphs to manage materialized triples

The rest of this assumes that you are familiar with querying and updating of SPARQL named graphs. To be honest, I use these rarely enough that I re-read the “Named Graphs” section of my book’s “Updating Data with SPARQL” chapter as a review before I assembled the steps below.

I mentioned above how the manager of a relational database might have to clear out and repopulate the denormalized tables periodically so that their information stays synchronized with the canonical data. With RDF, we can store materialized triples in named graphs to enable a similar effect. The steps below walk through one possible scenario for this using the Jena Fuseki triplestore.

Imagine that my company has two subsidiaries, company1 and company2, that use different schemas to keep track of their employees, and I’m using RDFS inferencing to treat all that data as if it conformed to the same schema. Here is a sample of company1 data:

# company1.ttl

@prefix c1d: <http://learningsparql.com/ns/company1/data#> . 
@prefix c1m: <http://learningsparql.com/ns/company1/model#> . 

c1d:rich c1m:firstName "Richard" . 
c1d:rich c1m:lastName "Mutt" . 
c1d:rich c1m:phone "342-667-9256" . 

c1d:jane c1m:firstName "Jane" . 
c1d:jane c1m:lastName "Smith" . 
c1d:jane c1m:phone "546-700-2543" .

Here is some company2 data:

# company2.ttl 

@prefix c2d: <http://learningsparql.com/ns/company2/data#> . 
@prefix c2m: <http://learningsparql.com/ns/company2/model#> . 

c2d:i432 c2m:firstname "Nanker Phelge" . 
c2d:i432 c2m:surname "Mutt" . 
c2d:432 c2m:homephone "879-334-5234" . 

c2d:i245 c2m:firstname "Cindy" . 
c2d:i245 c2m:surname "Marshall" . 
c2d:i245 c2m:homephone "634-452-4678" .

The two datasets use different properties in different namespaces (such as c1m:lastName vs. c2m:surname) to keep track of the same kinds of information.

The “upload files” tab of Fuseki’s web-based interface includes a “Destination graph name” field with a prompt of “Leave blank for default graph”. I specified a graph name of company1 when I uploaded company1.ttl and Fuseki gave this graph a full name of http://localhost:3030/myDataset/data/company1 because it was running on the default port of 3030 on my computer. (All of my queries below define d: as a prefix for http://localhost:3030/myDataset/data/, so I’ll use that to save some typing here.)

After uploading company2.ttl into a d:company2 named graph, I uploaded the following bit of modeling into a named graph called d:model. It names the company1 and company2 properties as subproperties of equivalent schema.org properties.

# integrationModel.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix c1m: <http://learningsparql.com/ns/company1/model#> . 
@prefix c2m: <http://learningsparql.com/ns/company2/model#> . 
@prefix schema: <http://schema.org/> . 

c1m:firstName rdfs:subPropertyOf schema:givenName . 
c1m:lastName rdfs:subPropertyOf schema:familyName . 
c1m:phone rdfs:subPropertyOf schema:telephone . 

c2m:firstname rdfs:subPropertyOf schema:givenName . 
c2m:surname rdfs:subPropertyOf schema:familyName . 
c2m:homephone  rdfs:subPropertyOf schema:telephone .

If I loaded all of the above triples into a triplestore that could do inferencing, I could query for schema:givenName, schema:familyName, and schema:telephone values right away and get all of the company1 and company2 data with that one query. For this example, though, I’m going to show how to do one-pass inferencing to set the stage for a query that can retrieve all that data using the schema.org property names.

The next step was to do that inferencing—that is, to create the inferred triples. Before updating data in a triplestore with an INSERT command, it’s good to do a CONSTRUCT query to double-check that you’ll be creating what you had hoped to, so I ran the following query. It looks in a dataset’s default graph and any named graphs for resources that have properties that are subproperties of other properties and then creates triples using those superproperties:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX d: <http://localhost:3030/myDataset/data/>

CONSTRUCT  { ?s ?superProp ?o }
WHERE
{
   { ?s ?p ?o }
   UNION
   { GRAPH ?g { ?s ?p ?o } }
   GRAPH d:model {?p rdfs:subPropertyOf ?superProp } .
}

For example, when this query sees that that c2m:firstname is a subproperty of schema:givenName and that c2d:i245 has a c2m:firstname of “Cindy”, it constructs a triple saying that c2d:i245 has a schema:givenName of “Cindy”. In other words, it expresses the original fact using a schema.org property in addition to the property from company1’s schema.

The complete result of this query showed all of the company1 and company2 data but using the schema.org properties instead of their original schemas. Being the result of a CONSTRUCT query, though, these triples are temporary.

I was then ready to really do the INSERT version of this query so that the new triples would be part of my dataset. That is, I was ready to do the actual inferencing. The following similar query inserts those triples into their own d:inferredData named graph so that when the time comes to update this redundant data it will be simple to clean out these materialized triples.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX d: <http://localhost:3030/myDataset/data/>

INSERT { GRAPH d:inferredData  { ?s ?superProp ?o }}
WHERE
{
  { ?s ?p ?o }
  UNION
  { GRAPH ?g { ?s ?p ?o } }
  GRAPH d:model {?p rdfs:subPropertyOf ?superProp } .
}

I used this next query to see if the triples I had seen with the recent CONSTRUCT query all got added to this new d:inferredData graph by the INSERT request. They had:

PREFIX d: <http://localhost:3030/myDataset/data/>

SELECT ?s ?p ?o
WHERE
{ 
   GRAPH d:inferredData { ?s ?p ?o } 
}

At this point I had integrated data from the two companies to conform to a common, standard model, and I could proceed with all the benefits of this arrangement as I queried across the two sets of employees by using the shared schema.

But, let’s say that Jane Smith changes her contact number from 546-700-2543 to 546-111-2222. This gets updated in the original company1 data in the d:company1 named graph with the following update request:

PREFIX d: <http://localhost:3030/myDataset/data/>
PREFIX c1d: <http://learningsparql.com/ns/company1/data#> 
PREFIX c1m: <http://learningsparql.com/ns/company1/model#> 

DELETE
{ GRAPH d:company1 { c1d:jane c1m:phone "546-700-2543" . } }
INSERT
{ GRAPH d:company1 { c1d:jane c1m:phone "546-111-2222" . } }
WHERE
{ GRAPH d:company1 { c1d:jane c1m:phone "546-700-2543" . } }

If I query the schema.org version of the data for Jane’s phone number I will still get her old one. This is easy enough to fix; first I blow away all the materialized triples,

PREFIX d: <http://localhost:3030/myDataset/data/>
DROP GRAPH d:inferredData

and then I regenerate up-to-date versions with the same INSERT command I used earlier. Problem solved. (If I have terabytes of triples of employee data, this DROP GRAPH followed by a new inferencing pass is the part that I’d do at 2AM each morning.)

Applying These Steps

I did all this by going to various Fuseki screens and pasting queries in. Fuseki has a nice feature in which after you run any query—even an update query—it shows you the URL and the curl command that would execute the same operation. This lets you string togethether these steps in a shell script that automates their execution, which would be handy for a production application. Instead of pasting all those queries into web forms I could just run that script, or, in the case of the 2AM updates, have a cron job run the script as I slept.

For a production application, there are a few other things I might change. For example, if there were millions of triples of company1 data and millions of triples of company2 data I might do the inferencing over just one or the other instead of everything at once. Assuming that they got updated on different schedules (because they are, after all, different companies) this would skip some unnecessary processing.

The ultimate lesson is that while named graphs are not particularly popular in typical SPARQL usage, they can be useful for managing large amounts of triples in which different sets of triples play different roles, and the materializing of inferenced triples is one nice example.

Are you using named graphs for any production application? Let me know at @bobdc or at @learningsparql.

Querying Wikidata for data that you just entered yourself

Bob DuCharme — Sun, 29 Mar 2020 12:08:00 +0000

After about four minutes.

Last month in Populating a Schema.org dataset from Wikidata I talked about pulling data out of Wikidata and using it to create Schema.org triples, and I hinted about the possibility of updating Wikidata data directly. The SPARQL fun of this is to then perform queries against Wikidata and to see your data edits reflected within a few minutes. I was pleasantly surprised at how quickly edits showed up in query results, so I thought I would demo it with a little video.

I had hoped that a video of a single unbroken shot could show me edit some data and then query for it and see the edits reflected. As it turned out, it wasn’t updated in the back end database quickly enough for that, so you don’t see the edit reflected in the query I made right after performing the edit in the video. As you’ll see in the screenshot below, the new data did show up about four minutes later.

Here is my four-minute video that would have been about seven minutes if, after editing data and trying immediately to query Wikidata’s SPARQL endpoint for the new data, I had kept recording and kept querying until I saw the edit reflected in the query result.

(One quick apology: not minding my Ps and Qs, I said “pname” at 1:01 when I meant to say qname, and even that wasn’t quite right; I was just talking about the URI’s local name, which would need a prefix to be a proper qname.)

As you see in the video, I queried for Keith Richards’ roles in the Rolling Stones, used the web interface to add “songwriter” as an additional role, and queried right away to see if this value showed up. It didn’t, and the date command showed that I was checking this at 11:09:52.

After I finished recording the video I created a shell script called temp1.sh with the curl command that sent the kr.rq SPARQL query to Wikidata’s endpoint and a date command to show when this happened. Once I saw that this two-line script worked, I added two more lines to make it a perpetual loop so that I could watch it and see what time “songwriter” showed up as one of Keith’s roles. As soon as I started up the looping version for the first time (11:13, as you can see below) it turned out that I didn’t need the loop: the available data was apparently updated just as I made that last edit to the script.

Here is the query if you’d like to try it yourself:

# following two lines should be executed as one if you use curl for this:
# curl --data-urlencode "query@kr.rq" -H "Accept: text/tab-separated-values"  
# https://query.wikidata.org/bigdata/namespace/wdq/sparql

SELECT ?roleName WHERE {
wd:Q189599 p:P361 ?roleStatement .                # Keith Richards has-role
      ?roleStatement rdf:type wikibase:BestRank ; # The best role statement!
                     pq:P2868 ?role .             # subject-has-role ?role.
      ?role rdfs:label ?roleName . 
      FILTER ( lang(?roleName) = "en" )
}

I’m sure that sometimes it will take longer than four minutes and sometimes it may be quicker, but that’s not a lot of time to wait, and it was fun seeing how my edit to this wonderful growing database was available to a SPARQL query sent to the database’s endpoint just a few minutes later.

Populating a Schema.org dataset from Wikidata

Bob DuCharme — Sun, 23 Feb 2020 11:00:00 +0000

Rock and Roll!

As the Schema.org vocabulary gets applied to more and more data and the data in Wikidata grows and grows, it’s only natural to think about the possibilities of creating Schema.org datasets that are populated from Wikidata.

From the Wikidata side, the Wikidata:Schema.org page provides an excellent discussion of the relationship between the two efforts. To summarize some key points: Schema.org is structurally much simpler than Wikidata to ease adoption, but because Schema.org provides no entity identifiers (for example, identifiers for specific people and places) “Schema.org is considering to encourage the use of Wikidata as a common entity base for the target of the schema:sameAs relation (not to be confused with owl:sameAs).”

From the Schema.org side, https://github.com/schemaorg/schemaorg/issues/280 has some discussion about the mapping of Schema.org to the Wikidata model. It’s mostly about modeling the relationships between common classes and properties—important tasks if you want to automate large-scale conversion between the two models. The schemaOrg-Wikidata-Map page, “for issue-280’s working group subsidy and reference”, has some good ideas for creating those mappings.

In a recent Twitter thread about Wikidata Dan Brickley asked me if I was “interested in cooking up clever queries to help slurp out subsets”. Yes! The query below pulls out almost 21,000 Wikidata triples of album and musician data for bands with a genre of rock and roll (or, in Wikidata terms, bands with a wdt:P136 of wd:Q11399). Wikidata currently has this kind of data for about 530 bands.

As with any mapping from one data model to another, some properties let you simply substitute a new name for an old name but others require judgment calls and some model traversal to get at what you want. I wanted to point out a domain-specific data model traversal issue I came across and a more general Wikidata one that will be an issue for people working with any data domain, not just rock and roll bands.

The domain-specific issues are important because while there are dreams of a generalized mapping between Wikidata and Schema.org, these two schemas both cover so much territory that it’s just not feasible. Here is my small example: while the Kinks studio album “Face to Face” is an instance of “album” in Wikidata, (wd:Q675825 wdt:P31 wd:Q482994), the Rolling Stones studio album “Beggars Banquet” is an instance of studio album (wd:Q339065 wdt:P31 wd:Q208569) which is a subclass of album (wd:Q208569 wdt:P279 wd:Q482994), as are live album (wd:Q209939) and compilation album (wd:Q222910). Because of this, my query that pulls out Wikidata triples to convert to Schema.org must look for instances of album and instances of subclasses of album. If the SPARQL engine could do inferencing, I could ask for instances of album, because an instance of a subclass is an instance of its superclass, but this SPARQL engine won’t do inferencing. Schema.org actually does have a schema:MusicAlbumProductionType class whose instances such as schema:studioAlbum, schema:LiveAlbum, and schema:CompilationAlbumcould store this distinction between various types of albums, but this doesn’t change the fact that Wikidata lists the studio album “Beggars Banquet” as an instance of “studio album” but the studio album “Face to Face” as an instance of the studio album superclass “album”. (Coming soon: how to correct the Wikidata data!)

Wikidata’s SPARQL engine has enough to do without doing inferencing; my query asks for a lot, and getting it to run in under 60 seconds to avoid a timeout took some rearrangement of triple patterns here and there to make it more efficient. I was surprised that I got away with including an OPTIONAL graph pattern and still kept everything under 60 seconds.

The use of UNION also helped retrieve the albums despite their different relationships to the data model. You’ll see that I UNIONed a third expression in there, which brings me to a key aspect of the Wikidata data model that queries must deal with: statements. Instead of having a triple saying that the work is an album, certain albums have triples saying that there are statements claiming that they are albums. (I’m not 100% sure about my wording describing the role of statements here and I’m open to correction.) This gives the query a bit more indirection to follow. Because Wikidata may have multiple statements about a topic, a query can request the highest ranked of these: we want the one that is an instance of wikibase:BestRank.

Whether you’re modeling rock and roll bands or commodity prices, the structure of these statements and availability of classes such as wikibase:BestRank will play a role in your programmatic access to Wikidata data. Removing the levels of indirection added by these statements will be typical of any mapping of Wikidata data to simpler models such as Schema.org. My query for band data also references Wikidata statements in order to request information about each album’s release date and each member’s role within the band—for example, that Keith Richards has the role “lead guitarist” with the Rolling Stones. (I would not rank this statement’s claim very highly; when Richards was paired with Brian Jones originally and with Ron Wood since 1976, the lack of clear lead and rhythm guitar roles was always an important part of the band’s sound, and when paired with Mick Taylor, Taylor was the lead guitarist.) Wikidata had minimal data about rock and roll band member roles, so I gingerly put the request in the OPTIONAL graph pattern mentioned above.

Here is the query. Note the use of comments to explain the meaning of each cryptic Wikidata prefixed name for easier readability.

# rockAndRollBandData.rq: retrieve personnel and album data about
# bands with a genre of rock and roll from Wikidata and output triples
# that use the schema.org model.

# From the command line (but executed on a single line): 

# curl --data-urlencode "query@rockAndRollBandData.rq"  
#   -H "Accept: text/turtle" 
#   https://query.wikidata.org/bigdata/namespace/wdq/sparql

PREFIX schema: <http://schema.org/> 
PREFIX wd:     <http://www.wikidata.org/entity/>
PREFIX wdt:    <http://www.wikidata.org/prop/direct/>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT {
  
   ?band   a schema:MusicGroup ;
           schema:name ?bandName ; 
           schema:musicGroupMember ?member ;
           schema:albums ?album .   

   ?album  a schema:MusicAlbum ;
           schema:name ?albumTitle ;
           schema:datePublished ?releaseDate . 

   ?member schema:name ?memberName ;
           schema:roleName ?roleName .  

}
WHERE {
   ?band wdt:P136 wd:Q11399 ;            # band has genre of rock and roll
         rdfs:label ?bandName ;
         wdt:P527 ?member  .             # band has-part ?member
   FILTER ( lang(?bandName) = "en" )

   ?member rdfs:label ?memberName .
   FILTER ( lang(?memberName) = "en" )
   OPTIONAL {                                     # Member's role. 
      ?member p:P361 ?roleStatement .             # part-of role statement.
      ?roleStatement rdf:type wikibase:BestRank ; # The best role statement!
                     pq:P2868 ?role .             # subject-has-role ?role.
      ?role rdfs:label ?roleName . 
      FILTER ( lang(?roleName) = "en" )
   }

   { ?album wdt:P31 wd:Q482994 . }       # instance of album (wd:Q482994)
   UNION
   { ?album wdt:P31 ?albumSubclass .     # or a subclass of that such as
     ?albumSubclass p:P279 wd:Q482994 .  # live or compilation album
   }
   UNION 
   { ?album wdt:P31 ?albumSubclass .
     ?albumSubclass p:P279 ?albumClassStatement .   # subclass of
     ?albumClassStatement ps:P279 wd:Q482994 ;
                          rdf:type wikibase:BestRank . 
   }

   ?album wdt:P175 ?band ;                      # has performer
          rdfs:label ?albumTitle ;
          p:P577 ?releaseDateStatement .        # publication date   
  
   FILTER ( lang(?albumTitle) = "en" )

   ?releaseDateStatement ps:P577 ?releaseDate ; # release date as ISO 8601
          rdf:type wikibase:BestRank .          # Only the best!

}

I would provide a link to the results, but you can run it yourself with the curl command shown in the query’s header if you store the query in a file called rockAndRollBandData.rq.

Once I had the Schema.org version it was fun to query that with queries that were much simpler than what would have been necessary with Wikidata. For example, the following asks this extracted data who has been a member of more than one band and what the bands were:

PREFIX schema: <http://schema.org/> 
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?member ?group1 ?group2  WHERE {
  
  ?groupURI1 a schema:MusicGroup ;
             schema:name ?group1 ;
             schema:musicGroupMember ?memberURI . 

  ?groupURI2 a schema:MusicGroup ;
             schema:name ?group2 ;
  schema:musicGroupMember ?memberURI . 
  
  ?memberURI schema:name ?member .

FILTER(?groupURI1 != ?groupURI2)
  
}

It would make an interesting class project to retrieve a larger, more complex set of data from Wikidata and then map it to a model such as Schema.org. The coordination of the participants’ activity (and triples) would be good work experience for everyone involved, and the project could result in something valuable to a particular domain’s community. This could include the development of procedures for the updating of their locally stored version as Wikidata evolves, as well as for updates to the source Wikidata data itself when there are gaps for that domain. (Again, coming soon: more on that latter issue!)

If you doing something like this on your own or with a group, let me know. I’d love to hear about it.

One-click replacement of an IMDb page with the corresponding Wikipedia page

Bob DuCharme — Sun, 19 Jan 2020 11:03:00 +0000

With some Python, JavaScript, and of course, SPARQL.

I recently tweeted “I find that @imdb is so crowded with ads that’s it’s easier to use Wikipedia to look up movies and actors and directors and their careers. And then there’s that Wikidata SPARQL endpoint!” Instead of just cursing the darkness, I decided to light a little SPARQL-Python-JavaScript candle, and it was remarkably easy.

Drag this bookmarklet link to your browser’s bookmarks bar: imdb2wp. Then, when you’re looking at the IMDb page of a person, movie, or television show, the link should take you right to the Wikipedia page for that entity.

The key to it all is the impressive amount of non-Wikidata identifiers that Wikidata has been adding. If you look at the IMDb page of, for example, the movie Medium Cool, in its URL of https://www.imdb.com/title/tt0064652/ you’ll see the movie’s IMDb identifier tt0064652. If you look at the movie’s Wikidata page, you’ll see that IMDb ID stored there. You won’t see the URL of its English Wikipedia page, but that’s easy enough to look up with the IMDb ID in the following SPARQL query:

SELECT ?wppage WHERE {
   ?subject wdt:P345 'tt0064652' .
   ?wppage schema:about ?subject .
   FILTER(contains(str(?wppage),'//en.wikipedia'))
}

Try it yourself. (Of course, a different filter condition can tell the query to find the corresponding Wikipedia page in language other than English.)

How does the click on the browser’s bookmark bar execute the SPARQL query with the appropriate IMDb ID? Last August in Custom HTML form front end, SPARQL endpoint back end I wrote about an application in which the end user enters the name of a cocktail ingredient, clicks the search button, and then (after a SPARQL query asks Wikipedia for drinks that have that ingredient) that user sees a web page displaying those drinks with links to their Wikipedia pages. This new script, in Python this time, is also a CGI script. It accepts a parameter, plugs that parameter into a SPARQL query, sends the query off to the Wikidata endpoint, and then uses the result to give users what they want:

#!/usr/bin/env python
# imdb2wp.cgi:go to Wikipedia page for a movie or 
# person based on their IMDB ID value. Sample call:

# http://learningsparql.com/cgi/imdb2wp.cgi?imdbID=nm0000598

import sys
# Following needed for hosted version to find SPARQLWrapper library
sys.path.append('/home/bobdc/lib/python/')
from SPARQLWrapper import SPARQLWrapper, JSON
import cgi

form = cgi.FieldStorage() 
imdbID = form.getvalue('imdbID')

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
# SPARQL query of Wikidata asks for the Wikipedia 
# page of whatever has this IMDB ID.

queryString = """
SELECT ?wppage WHERE {
?subject wdt:P345 'IMDB-ID' . 
  ?wppage schema:about ?subject .
  FILTER(contains(str(?wppage),'//en.wikipedia'))
}
"""

queryString = queryString.replace("IMDB-ID",imdbID)
sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)

try:
  results = sparql.query().convert()
  requestGood = True
except Exception, e:
  results = str(e)
  requestGood = False

print "Content-type: text/html\n\n"

if requestGood == False:
  print "<h1>Problem communicating with the server</h1>"
  print "<p>" + results + "</p>"
elif (len(results["results"]["bindings"]) == 0):
  print "<p>No results found.</p>"

else:

  for result in results["results"]["bindings"]:
    wppage = result["wppage"]["value"]

print ("<meta http-equiv=\"Refresh\" content=\"0;" + wppage + " \">")

Note how short the script is even with its comments, white space, and error handling. As its header comment tells us, the script is called as a web service. If you replace the sample call to it shown there with the Medium Cool id of tt0064652, you’ll get the URL http://learningsparql.com/cgi/imdb2wp.cgi?imdbID=tt0064652, which as you can see by clicking it calls the script that sends you to that Wikipedia page. The script stores the passed value in an imdbID variable and then inserts it into a query that looks just like the one hard-coded for “Medium Cool” above. Then, the script sends the query off to the Wikidata SPARQL endpoint.

At a similar point in the Perl script that lists which cocktails have the entered ingredients, the script displays some HTML showing the results. The imdb2wp script does not render a page with results but instead sends back a meta refresh page. (I only recently learned that that was the actual name for these, and it is an excellent name.) This just sends the user to the Wikipedia page found by the SPARQL query.

How does the single click call the CGI script? The bookmarklet’s URL is actually a bit of JavaScript that pulls the IMDb ID from the displayed page’s URL, appends it to http://learningsparql.com/cgi/imdb2wp.cgi?imdbID=, and sends the browser off to the result. So to review:

When viewing an IMDb page, you click the bookmarklet.
JavaScript in the bookmarklet calls the CGI script with the IMDb ID.
The CGI script plugs the IMDb ID into a SPARQL query and uses that query to ask Wikidata for the entity’s Wikipedia URL.
The CGI script redirects you to that URL.

SPARQL is just a query language—a syntax for describing what to do with a certain kind of data. The real value is in the data that we can query with SPARQL, and Wikidata is becoming more and more valuable. I found it surprisingly easy to use some otherwise old-fashioned (and standardized!) technologies to go from complaining about IMDb to actually doing something about the annoyance.

This is just a taste of the many possibilities we’ll see from Wikidata’s storage of so many standard identifiers for real-world entities. Whatever domain you work in or want to work in, take a look at what kind of identifiers and other data Wikidata stores about that domain’s entities and you may very well be inspired to do something no one else has done in that domain by using SPARQL and scripting to mix and match that data with other data in that domain. Let me know if you do!

Ancient Mesopotamian metadata

Bob DuCharme — Sun, 29 Dec 2019 10:56:00 +0000

4,000 years old!

In an October 14th article in the New Yorker about the use of Artificial Intelligence to generate prose, John Seabrook wrote: “A recent exhibition on the written word at the British Library dates the emergence of cuneiform writing to the fourth millennium B.C.E., in Mesopotamia”. That got me thinking about some notes I once took on the early history of metadata, and I wondered if there was any scholarship to show that the earliest metadata is as old as the earliest writing. Not quite, but cuneiform tablets of metadata from the early second millennium B.C.E. are still some pretty old metadata.

First, how do I define “metadata”? The classic definition “data about data” is a bit vague; a movie review is data about data, but it’s not metadata. I would define metadata as data—ideally, structured data—recorded to aid in the navigation of other data. I was going to say “navigation and retrieval and maintenance”, but you can’t efficiently retrieve or maintain data that you have difficulty finding, so it all builds from navigation. As a working definition I think this covers most uses of metadata.

I followed a footnote from the 2000 book The Great Libraries: From Antiquity to the Renaissance to the article Archive and Library Technique in Ancient Mesopotamia published by Danish researcher Mogens Weitemeyer in the International Library Review journal Libri in 1956. The article’s main point is to explore the idea of a “library” as opposed to an “archive” as these terms may apply to a particular archaeological site. To describe one particular set of cuneiform tablets that led to a library vs. archive debate among scholars, Weitemeyer wrote

Some small tablets from the III Dynasty of Ur (a few somewhat older) found in Lagash, Umma (Djoha), Puzurish-Dagan (Drehem), and Ur tell us about the way in which the archive tablets were stored. At the left edge of the small tablets there are two holes comparatively near each other. From one hole ot the other extended a strand of reed (thin like bast), the impression of which is still clearly visible in the clay (Fig. 4b). By means of this reed-strand the small tablet was fastened to a container of tablets. This appears from the first line of the small tablet, which reads, in Sumerian, gá-dub-ba (dub=tablet, gá=container), i.e. tablet container. Hence, the small tablets were no doubt labels, attached to the receptacles and indicating their contents.

“Figure 4b” refers to the label tablet on the right in the picture above. Weitemeyer went on to point out how you can see the pattern from the basketwork in the tablet on the left of the picture. He also went on to say

The labels first stated that the receptacle was a tablet basket; then followed information about the contents of the tablets, e.g. legal verdicts, accounts, receipts and expenses. At the end was an indication of the period covered; in most cases the period was one year, in some cases the beginning year (or month) and the finishing year (or month) were indicated.

Each small tablet had information about a larger dataset (the content of the container it was attached to) to help people determine whether the information they needed was in that container. Not only is this clearly metadata, but with the apparently regular practice of indicating the period covered by the referenced data at the end of the small tablet’s description, this metadata even has some structure to it. Recording the date range covered by a set of described data has continued to be a pretty classic piece of metadata, and with the Third Dynasty of Ur being 4,000 years ago, that’s some pretty old structured metadata.

I have been researching the history of metadata on and off for a few years and may write up some more of what I found in future blog entries. (The next stop would be Mycenaean Greece.) It has been fun to find that the idea of metadata, which we consider to be so modern today, has actually been around for literally thousands of years.

Avoiding accidental cross products in SPARQL queries

Bob DuCharme — Sun, 17 Nov 2019 09:30:00 +0000

Because one can sneak into your query when you didn't want it.

Check the variables in your triple patterns that are connecting up sets of triples with other sets. They may not be doing a good job of it.

Have you ever written a SPARQL query that returned a suspiciously large amount of results, especially with too many combinations of values? You may have accidentally requested a cross product. I have spent too much time debugging queries where this turned out to be the problem, so I wanted to talk about avoiding it.

Let’s look at a simple example. The following RDF triples show the names of three people and the departments where they work:

@prefix d:  <http://learningsparql.com/ns/data#> .

d:emp1 d:name "jane" .
d:emp2 d:name "joe" .
d:emp3 d:name "pat" .

d:emp1 d:dept "shipping" .
d:emp2 d:dept "receiving" .
d:emp3 d:dept "accounting" .

The following SPARQL query attempts to list each person and their department:

PREFIX d:  <http://learningsparql.com/ns/data#> 

SELECT ?name ?dept WHERE {
  ?employee d:name ?name .
  ?emp d:dept ?dept .
}

The result of this query somehow shows that all the employees work in all the departments:

-------------------------
| name   | dept         |
=========================
| "pat"  | "shipping"   |
| "pat"  | "receiving"  |
| "pat"  | "accounting" |
| "jane" | "shipping"   |
| "jane" | "receiving"  |
| "jane" | "accounting" |
| "joe"  | "shipping"   |
| "joe"  | "receiving"  |
| "joe"  | "accounting" |
-------------------------

Why? Experienced SPARQL users probably already saw the problem: the query’s first triple pattern says “find any triples where the predicate is d:name and store the subject in ?employee and the object in ?name”. The second triple pattern should ask for the department of any employee that we found in the first triple pattern (?employee). Instead, it’s just asking for all triples with d:dept as the predicate and binding the subject and object to the ?emp and ?dept variables, which have nothing to do with the first triple pattern. If the second triple pattern had used the variable name ?employee instead of ?emp, the query would have asked for resources that matched both triple patterns, and would have given this result:

-------------------------
| name   | dept         |
=========================
| "pat"  | "accounting" |
| "jane" | "shipping"   |
| "joe"  | "receiving"  |
-------------------------

I got three times as many results as I wanted because I created the new variable name ?emp when I should have re-used the existing one ?employee. Avoiding such variable name sloppiness is why some programming languages force you to declare variables. It’s also why others that don’t, such as JavaScript and perl, offer optional add-ins that force this extra bit of housekeeping.

When the Franz Allegrograph triplestore sees a cross product it offers a query warning automated alert called warn-bgp-cross-product, so I’ll bet that has saved their developers a lot of wasted time. The documentation for this potential warning has a nice summary of what causes cross products: “there are patterns in the query that have disjoint sets of variables which will cause the SPARQL engine to find all possible matches between the sets which can lead to very large solution sets”. (Some pdf class notes for a Colorado State University database class show how this works with relational databases.)

In my example cross product above, note that the offending variable names are not mentioned in the SELECT statement and therefore are not in the results. I have found that this can add plenty to the time it takes to identify a cross product as the source of a problem, because these mismatched variables are like cogs that are not meshing together correctly deep inside a machine where you can’t see them very well. This is especially true in a larger, more complex query; my query above is a small toy example to make the problem as clear as possible.

One larger, more complex query where this happened was the second SPARQL query in my Document analysis with machine learning blog entry last month. Not only did it cost me extra hours of work; the results were so bloated that arq was running out of memory, so I started doing the query in Blazegraph instead. When I noticed the same cosine similarity figure coming up with dozens of recipe pairings, this was the first warning that I had a cross product problem, just like with the repetitive patterns of all employees working in all departments above. I had no problem running the query with arq once I found the mismatched variable names and straightened out the cross product problem.

So, if you see such repetition and get suspicious, check the variables in your triple patterns that are connecting up sets of triples with other sets. They may not be doing a good job of it.

Document analysis with machine learning

Bob DuCharme — Sun, 27 Oct 2019 11:00:00 +0000

Cookbook recipes!

For people doing digital humanities work, the possibilities in the document embeddings corner of the machine learning world look especially promising.

I’ve been thinking about which machine learning tools can contribute the most to the field of digital humanities, and an obvious candidate is document embeddings. I’ll describe what these are below but I’ll start with the fun part: after using some document embedding Python scripts to compare the roughly 560 Wikibooks recipes to each other, I created an If you liked… web page that shows, for each recipe, what other recipes were calculated to be most similar to that recipe.

In Semantic web semantics vs. vector embedding machine learning semantics I wrote about how neural networks can assign vectors of values to words based on the relationships among words in a given text corpus. Once these word vectors are “embedded” in a common vector space, relationships between those vectors can reflect the semantics of the words. The classic examples are asking a system that has done this for a decent-sized corpus of English text “king is to queen as man is to what” or “London is to England as Berlin is to what”. By comparing the calculated vectors, it’s relatively easy for a system to answer “woman” to the first question and “Germany” to the second.

That post also mentioned how we can assign vectors to other things besides words. Plenty of code is available to generate and work with document embeddings, so I tried this with the flair Python NLP framework available on github. For an introduction to flair, I recommend the github page’s tutorial and the article Text Classification with State of the Art NLP Library — Flair.

To generated document embedding vectors for the Wikibooks recipes and then compare them all with each other I based my demo script below on the flair example at the cosine_similarity_using_embeddings git repo. My demo shown here just does a few recipes, for reasons explained further down, and outputs RDF about the similarity scores it calculated so that I could perform SPARQL queries about those similarities.

#!/usr/bin/env python     

# Read Wikibook recipes, calculate document vectors for each, calculate
# all cosine similarity pairings, and output RDF about the result. Recipes
# were downloaded from https://en.wikibooks.org/wiki/Category:Recipes, tags
# were stripped, and then <title></title> and <url></url> tags added to each.

import glob
import re
import pickle
from flair.embeddings import Sentence, StackedEmbeddings, FlairEmbeddings,WordEmbeddings

import time
import numpy as np
import regex as re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Most of this code is based on
# https://github.com/swapnilg915/cosine_similarity_using_embeddings/blob/master/flair_embeddings.py

# initialize embeddings
glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')


class FlairEmbeddings(object):

    def __init__(self):
            self.stop_words = list(stopwords.words('english'))
            self.lemmatizer = WordNetLemmatizer()
            self.stacked_embeddings = StackedEmbeddings(
                    embeddings=[flair_embedding_forward, flair_embedding_backward])

    def word_token(self, tokens, lemma=False):
            tokens = str(tokens)
            tokens = re.sub(r"([\w].)([\~\!\@\#\$\%\^\&\*\(\)\-\+\[\]\{\}\/\"\'\:\;])([\s\w].)", "\\1 \\2 \\3", tokens)
            tokens = re.sub(r"\s+", " ", tokens)
            if lemma:
                    return " ".join([self.lemmatizer.lemmatize(token, 'v') for token in \
                                     word_tokenize(tokens.lower()) if token not in self.stop_words \
                                     and token.isalpha()])
            else:
                    return " ".join([token for token in word_tokenize(tokens.lower()) if token not in \
                                     self.stop_words and token.isalpha()])

    def cos_sim(self, a, b):
            return np.inner(a, b) / (np.linalg.norm(a) * (np.linalg.norm(b)))

    def getFlairEmbedding(self, text):
            sentence = Sentence(text)
            self.stacked_embeddings.embed(sentence)
            return np.mean([np.array(token.embedding) for token in sentence], axis=0)

#################
        
if __name__ == '__main__':
    recipeDirectory = ''

    # For this demo, just get the recipes whose titles begin with "J". 
    filenameArray = glob.glob('/home/bob/temp/wprecipes/data/g-p/Cookbook:J*')

    print('# start: ' + time.strftime('%H:%M:%S'))

    recipeDataArray = []   # each entry will be an array with the following entries so
    # that they can be referenced like this: recipeDataArray[3][recipeTitleField]
    recipeTitleField = 0
    urlField = 1
    recipeField = 2
    recipeEmbeddingField = 3

    obj = FlairEmbeddings()

    for file in filenameArray:
        recipeContent = ''
        input = open(file, "r")
        for line in input:
            if ("<title>" in line): 
                title = re.sub(r'^\s*<title>','',line)  # Remove title tags. 
                title = re.sub(r'\s*</title>\s*','',title) 
            recipeContent = recipeContent + line
            if ("<url>" in line): 
                url = re.sub(r'^\s*<url>','',line)  # Remove url tags.
                url = re.sub(r'\s*</url>\s*','',url) 
                ##print(file + ': ' + url)
            recipeContent = recipeContent + line
        input.close()
        recipeDataArray.append([title,url,recipeContent])

    print('# starting to calculate embeddings: ' + time.strftime('%H:%M:%S'))

    # Calculate and save embeddings
    for r in recipeDataArray:
        recipeEmbedding = obj.getFlairEmbedding(r[recipeField])
        r.append(recipeEmbedding)

    print('# starting comparisons: ' + time.strftime('%H:%M:%S'))

    # output header of RDF
    print("@prefix d: <http://learningsparql.com/data#> .")
    print("@prefix m: <http://learningsparql.com/model#> .")
    print("@prefix dc: <http://purl.org/dc/elements/1.1/> .\n")

    # Find the cosine similarity of all the combinations
    recipesToCompare = len(recipeDataArray)  # or some small number for tests
    i1 = 0
    while i1 < recipesToCompare:
        title = recipeDataArray[i1][recipeTitleField].replace('\"','\\"')
        # Output a triple with the recipe's title. 
        print('<' + recipeDataArray[i1][urlField] + '>  dc:title "' + title + '" .')
        i2 = i1 + 1;
        while i2 < recipesToCompare:
            # output triples like [ m:doc recipeN, recipeN+1 ; m:recipeCosineSim 0.8249611 ] 
            recipeCosineSim = \
            obj.cos_sim(recipeDataArray[i1][recipeEmbeddingField],
                        recipeDataArray[i2][recipeEmbeddingField])

            print('[ m:doc <' + recipeDataArray[i1][urlField] +
                  '>, <' + recipeDataArray[i2][urlField] +
                  '> ; m:recipeCosineSim ' + str(recipeCosineSim) + ' ] . ')
            i2 += 1
        i1 += 1

print('# finished: ' + time.strftime('%H:%M:%S'))

On my Dell XPS 13 9350 laptop it took about 30 seconds to calculate each embedding. For 546 recipes, that is several hours, and my laptop was running very hot after half an hour of that. (This got Start Me Up completely stuck in my head throughout the experiment.) The script above demonstrates the steps of what I did at a small scale, but to create the full “If you liked…” recipe comparison page I did the following. (You can find the scripts and queries results for this in my own github repository.)

Instead of reading all the recipes, calculating their embeddings, and calculating their similarities in one run, I split the script above in half. The first half performed three steps:

Read a third of the recipes. Without the “J” in data/g-p/Cookbook:J* above, that’s the middle third of the recipe collection; the other two thirds were in data/a-f and data/q-z.
Calculate embeddings for each recipe.
Store the resulting array in a Python pickle file.

(If I pursue this more I plan to do all that in one batch on an Amazon AWS EC2 instance. Machine learning in the cloud is a topic you hear about often, and when you could fry an egg on your own laptop it starts to look especially appealing.)

After running that first script on the three batches of recipes, my second script read the pickle files that the three runs created into one big recipeDataArray array and then did the “Find the cosine similarity of all the combinations” part of the script above on that array. Even with 546 recipes, that only took two seconds. It’s nice to know that if a linear increase in the number of documents to compare turns out to mean a geometric increase in the number of comparisons to make, the calculation of each pair’s similarity is so quick that the geometric increase is not a big deal–at least at this scale. (Some of the embedding vectors didn’t come out because the input was apparently an empty string, according to the error messages. This resulted in 1% of the cosine similarity figures being vectors of “nan”, or Not a Number, values. If I was doing this for a paying client I would find the input that caused these problems and do something about it, but for a personal project fun demo I just removed the offending vectors moving on to the next step.)

After this script output RDF about the recipe similarities I could then explore the results. The following excerpt from that RDF gives you the flavor of what the queries had to work with. Each comparison is a blank node connecting up information about what two documents were compared and their comparison score. The dc:title triples show the actual titles of recipes:

[ m:doc 
  <https://en.wikibooks.org/wiki/Cookbook:Apple_Raisin_Oat_Muffins>,
  <https://en.wikibooks.org/wiki/Cookbook:Chewy_Ginger_Cookies> ; 
  m:recipeCosineSim 0.90684676 ] . 
  
<https://en.wikibooks.org/wiki/Cookbook:Apple_Raisin_Oat_Muffins> 
   dc:title "Cookbook:Apple Raisin Oat Muffins" .

The following SPARQL query lists all the pairings in ascending order of cosine similarity:

PREFIX m: <http://learningsparql.com/model#> 
PREFIX dc: <http://purl.org/dc/elements/1.1/> 

SELECT ?score ?title1 ?title2 WHERE {
                      ?comparison m:doc ?recipe1, ?recipe2;
  m:recipeCosineSim ?score .
  ?recipe1 dc:title ?title1 .
  ?recipe2 dc:title ?title2 .
  FILTER (?recipe1 != ?recipe2)
}
ORDER BY ?score

It turns out that the two most similar recipes are Wonton Soup and the Egg Roll recipe, with a cosine similarity score of 0.9928804. The pairing of Pork Pot Pie and Chicken Pot Pie II came in second. (I was relieved to see that there was no Chicken Pot Pie I, because if there had been and it wasn’t more similar to its sequel than the Pork Pot Pie, then the whole model’s ability to determine similarities would be much more questionable. As you’ll see below, it’s questionable anyway, but there are actions I can take to try to improve it.)

A slight variation on the above query created the basis of my “If you liked…” page. It sorts the results by the recipe titles and then by descending order of what is most similar to each recipe.

PREFIX m: <http://learningsparql.com/model#> 
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>	

# A comparison object looks like this:
# [ m:doc <https://en.wikibooks.org/wiki/Cookbook:Apple_Raisin_Oat_Muffins>, 
#   <https://en.wikibooks.org/wiki/Cookbook:Adobo> ; m:recipeCosineSim 0.8590696 ] .

SELECT ?score ?doc1URL ?doc1title ?doc2URL ?doc2title WHERE {
   ?comparison m:doc ?doc1URL, ?doc2URL; m:recipeCosineSim ?score .
   ?doc1URL dc:title ?doc1title .
   ?doc2URL dc:title ?doc2title .
  
  FILTER(?doc1URL != ?doc2URL)
  
  # Experimenting with the cutoff that lead to a figure of .975:
  # .92: 189970 result lines; .95: 76866; .97: 8562; .98: 544; .975: 2604
  FILTER(?score > .975)
}
ORDER BY ?doc1title desc(?score)

When I ran this with arq I asked it for tab-separated value output, and I wrote a perl script to convert that to HTML. I also did a little hand-editing at the top of the HTML file to add the introduction, and that’s what you see at the If you liked… page. This web page is hopefully useful because you can easily look up a given recipe and find out which others are most similar to it. All recipe names there are links to the original recipes; their URLs were easy to store throughout the various steps because I treated each recipe’s URL as the URI for that document. This whole Linked Data thing is pretty useful sometimes!

You can find the perl script and queries, along with the scripts I used to pull down and pepare the recipe data, in the git repository I created.

A great feature of the Wikibooks recipe collection is how many different cuisines are represented, so I was hoping for some interesting cross-cultural pairings, but so many of the pairings make so little sense that it’s difficult to take the non-obvious ones seriously. This starts with the very first one on the list: I have no idea why it rates Macaroni and Cheese as the closest thing to Bánh mì. Ranking Guacamole as very similar to A Nice Cup of Tea is even worse.

This reminds us that, as VCs throw their money at AI startups who promise easy, plug-and-play machine learning, we must remember what machine learning people call the “no free lunch” principal: no single model is going to do everything well. Getting good results means tweaking parameters for how tools do their work, and knowing what tweaks to make requires some study.

I have ideas for experiments to get the cosine similarity scores to make more intuitive sense. Many of the values set in the “initialize embeddings” and “class FlairEmbeddings” sections of the script above were choices made from a particular selection. (I didn’t choose them myself, but just copied them from othe examples.) For example, instead of using news-forward and news-backward as the character-level language models, I could have selected from other choices described in the source code. For the word embeddings that the flair document embeddings build on, the same source code shows other alternatives to GloVe.

But, the time it takes to calculate all of those document embeddings makes it difficult to quickly churn through different combinations of initialization settings. I started up a few small cheap AWS Amazon Machine Images and was unable to install flair on them, so my next line of research is to keep looking for a good one for this. (I appreciate any suggestions…)

Still, the fact that I could take someone else’s 63-line script, modify it a bit, and use machine learning to create an HTML index of recipe similarity in a good-sized yet diverse cookbook means that getting and then tweaking such tools is not a super difficult thing to do with a collection of documents. For people doing digital humanities work, the possibilities in the document embeddings corner of the machine learning world look especially promising.

Converting JSON to RDF

Bob DuCharme — Sun, 22 Sep 2019 11:00:00 +0000

Any JSON at all.

The real payoff of easy conversion of JSON to RDF is the ease with which you can then integrate that data with other datasets.

When I was at TopQuadrant, I learned that their SPARQLMotion scripting language had a module that could convert JSON to RDF. This had nothing to do with JSON-LD—it worked with any JSON at all, using blank nodes to indicate the grouping of data within arbitrary structures.

Because this tool is only available to paying TopQuadrant customers (or those in the first 30 days of the trial version of TopBraid Composer Maestro Edition), I’ve kept my eye out for a free tool that would do this, and I was happy to see AtomGraph’s JSON2RDF on github. I had to build the binary myself, but this was easy enough after a quick install of the maven build tool. As the JSON2RDF github readme file tells us, mvn clean install is all you need to build a jar file. A Docker image is also available.

I could then run it on a myinput.json input file to create a myoutput.json file with this command line:

java -jar json2rdf-1.0.0-SNAPSHOT-jar-with-dependencies.jar http://example.com/test# < myinput.json > myoutput.ttl

As you’ll see in the sample output below, the converter uses the URL provided in the command line as the base URI for the properties in the output.

To test it I ran that command line using the following handmade JSON as input:

{ 
    "color": "red", 
    "amount": 3,                      
    "arrayTest": ["north","south","east","west",3,"escaped \/string"],
    "boolTest": true,
    "nullTest": null,
    "addressBookEntry": {
        "first": "Richard",
        "last": "Mutt",
        "address": {
            "street": "1 Main St",
            "city": "Springfield" ,
            "zip": "10045"
        }
    }
}

Here is the output that AtomGraph’s JSON2RDF created:

_:B6bba <http://example.com/test#color> "red" .
_:B6bba <http://example.com/test#amount> "3"^^<http://www.w3.org/2001/XMLSchema#int> .
_:B6bba <http://example.com/test#arrayTest> "north" .
_:B6bba <http://example.com/test#arrayTest> "south" .
_:B6bba <http://example.com/test#arrayTest> "east" .
_:B6bba <http://example.com/test#arrayTest> "west" .
_:B6bba <http://example.com/test#arrayTest> "3"^^<http://www.w3.org/2001/XMLSchema#int> .
_:B6bba <http://example.com/test#arrayTest> "escaped /string" .
_:B6bba <http://example.com/test#boolTest> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
_:B6bba <http://example.com/test#addressBookEntry> _:Bcd68.

_:Bcd68 <http://example.com/test#first> "Richard" .
_:Bcd68 <http://example.com/test#last> "Mutt" .
_:Bcd68 <http://example.com/test#address> _:B9a02 .

_:B9a02 <http://example.com/test#street> "1 Main St" .
_:B9a02 <http://example.com/test#city> "Springfield" .
_:B9a02 <http://example.com/test#zip> "10045" .

(To make it easier to read on this page I replaced the original blank node URIs created by JSON2RDF with shorter versions and added two carriage returns.) You can see that the converter handled the data types, escaped string, and nested structures just fine. This output also provides a nice lesson in how, although the simplicity of the RDF data model means that any data collection is a flat list of triples, you can still represent more complex data structures with very little trouble.

That was a hand-curated example. To test it on something from the wild, I grabbed the following from the JSON and BSON page of mongodb.com:

{
  "_id": 1,
  "name" : { "first" : "John", "last" : "Backus" },
  "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  "awards" : [
    {
      "award" : "W.W. McDowell Award",
      "year" : 1967,
      "by" : "IEEE Computer Society"
    }, {
      "award" : "Draper Prize",
      "year" : 1993,
      "by" : "National Academy of Engineering"
    }
  ]
}

JSON2RDF turned it into this (again, with blank node URIs replaced with shorter versions for easier reading):

_:Bcd72 <http://example.com/test#_id> "1"^^<http://www.w3.org/2001/XMLSchema#int> .
_:Bcd72 <http://example.com/test#name> _:Be87 .
_:Be87 <http://example.com/test#first> "John" .
_:Be87 <http://example.com/test#last> "Backus" .
_:Bcd72 <http://example.com/test#contribs> "Fortran" .
_:Bcd72 <http://example.com/test#contribs> "ALGOL" .
_:Bcd72 <http://example.com/test#contribs> "Backus-Naur Form" .
_:Bcd72 <http://example.com/test#contribs> "FP" .
_:Bcd72 <http://example.com/test#awards> _:Bbc13 .
_:Bbc13 <http://example.com/test#award> "W.W. McDowell Award" .
_:Bbc13 <http://example.com/test#year> "1967"^^<http://www.w3.org/2001/XMLSchema#int> .
_:Bbc13 <http://example.com/test#by> "IEEE Computer Society" .
_:Bcd72 <http://example.com/test#awards> _:Ba9d .
_:Ba9d <http://example.com/test#award> "Draper Prize" .
_:Ba9d <http://example.com/test#year> "1993"^^<http://www.w3.org/2001/XMLSchema#int> .
_:Ba9d <http://example.com/test#by> "National Academy of Engineering" .

I ran this SPARQL query against those triples to find awards from after 1990,

PREFIX e: <http://example.com/test#>

SELECT ?awardName ?year WHERE {
   ?award e:year ?year ;
  e:award ?awardName  .
  
  FILTER (?year > 1990)
}

and got this result:

-------------------------------------------------------------------
| awardName      | year                                           |
===================================================================
| "Draper Prize" | "1993"^^<http://www.w3.org/2001/XMLSchema#int> |
-------------------------------------------------------------------

This is still a rather artificial example. Before converting that JSON about John Backus I could have just queried it directly with a tiny bit of JavaScript or an even tinier jq expression. The real payoff of easy conversion of JSON to RDF is the ease with which you can then integrate that data with other datasets. With the vast amount of JSON data out there, this means that there is even more data to take advantage of in RDF-based applications.

For example, imagine that you have two different MongoDB JSON datasets designed independently by two different developers. Merging these into a single JSON dataset so that you can treat the combination as a whole that is greater than the sum of its parts is going to be a lot of ETL work. With the data in RDF, you only need a CONSTRUCT query for each dataset to rename some properties. (Aa few class, subclass, and subproperty declarations might be handy for a little data modeling, but these are optional.) Then, you just append one set of transformed triples to the other and you’ve got a single dataset.

Two more notes about AtomGraph’s JSON2RDF:

Make sure to read through all the readme information on their Atomgraph’s github page.
As with SPARQLMotion’s ConvertJSONToRDF module, Atomgraph’s utility is part of a collection of tools that they make available to pipeline together for application development. Unlike SPARQLMotion, it’s open source and can be run from the command line, so in the old-fashioned Unix sense of the word “pipeline” it can be connected up to tools from other developers as well, such as the aforementioned jq.

Custom HTML form front end, SPARQL endpoint back end

Bob DuCharme — Sun, 25 Aug 2019 09:00:00 +0000

Your website's users sending SPARQL queries, even if they haven't heard of SPARQL.

In a recent Twitter exchange, Dr Joanne Paul asked “Does/can this exist? A website where I enter a title (eg. ’earl of pembroke’) and a year (eg. 1553) and it spits out who held that title in that year (in this case, William Herbert).” Michelle Watson replied “I bet you could probably write SPARQL query to Wikipedia that would come close to doing that. Not sure how you’d embed that into a webpage though.” I replied to that: “Have an HTML form that hands the entered values to a CGI script (Perl or Python or whatever) that plugs the values into a SPARQL query, sends that off to Wikipedia, and formats the result as HTML” and then “See pages 285 - 291 of my book “Learning SPARQL” for an example that uses Python and IMDB. The Python script is at http://www.learningsparql.com/1steditionexamples/ex364-cgi.txt .”

I thought I’d done a simple example on my blog outside of the book and couldn’t find it, so I’m doing another one here because it’s so easy. Instead of a Python CGI (Common Gateway Interface) script calling linkedmdb.org like I did in the book, I wrote a Perl CGI script that calls Wikidata. Instead of having the end user enter the names of two directors on a form and then listing all the actors who have been in films by both directors, like I did in the Python example, in my new one the end user enters the name of a cocktail ingredient and clicks a button. Then, a dynamic web page lists the cocktails that use that ingredient with links to each cocktail’s Wikipedia page. (The example in the book called a SPARQL server at the Linked Movie Database, which doesn’t seem to work anymore anyway.) Either way, the key is that the person entering the query criteria is simply filling out a form and they don’t need to know anything about the technology on the back end.

Before creating such a query, I had to ask: does Wikidata have the data I need to determine which drinks have which ingredients? Wikipedia infoboxes are usually the quickest way to assess whether the data you need is available in a structured form. If you look at the Wikipedia page for a Negroni, the infobox lists the ingredients in a fairly structured way, which usually means that the data is available in Wikidata with enough structure to query it. The infobox also shows that a Negroni is an IBA (International Bartenders Association) Official Cocktail, or in data modeling terms, it’s an instance of a class that we can query for. (The narrative text of the page also has an excellent origin story about how Count Camillo Negroni inspired the drink’s creation in 1919 and how Orson Welles had something clever to say about the drink 28 years later.)

The basic steps for creating a web form that calls a SPARQL endpoint:

Write a SPARQL query that requests a specific example of the thing you want from the endpoint. My query asked for cocktails where “bitters” was an ingredient.
Create a web page with an HTML form where the end user can enter the value or values that will customize the query.
Add the SPARQL query to a CGI script that takes the values passed from the web form, plugs them into the appropriate places in the query, sends the query off to the endpoint, and then displays the result as HTML.

The results of steps 1 and 3 end up in the same CGI file, and the result of step 2 is so small and simple (526 bytes, even with a dash of CSS) that you should take a quick look at my SPARQL cocktail query HTML form and its source before I describe the CGI file. As you’ll see, when the user clicks the form’s “search” button, the form passes the entered value to the script in a q variable.

Here is the Perl CGI script:

#!/usr/bin/perl

# sample call: http://www.bobdc.com/cgi/sparqlcocktail.cgi?q=scotch

require sparql;  # Assumes that sparql.pm is in this directory; comes
# from https://github.com/swh/Perl-SPARQL-client-library

use strict;
use CGI;

# Usage of Perl-SPARQL-client-library based on test.pl included with it
my $params = CGI->new;
my $searchTerm = $params->param('q');

my $sparql = sparql->new();
my $endpoint = 'https://query.wikidata.org/sparql';

# Prefixes used in query don't need declarations
# because the endpoint has all of these predeclared. 
my $query = '
SELECT ?cocktailName ?wikipediaURL ?ingredientName WHERE {
  BIND ("SEARCHTERM" AS ?searchTerm )
  # ?cocktail instance of IBA official cocktail, 
  ?cocktail wdt:P31 wd:Q2536409 ;  
          # material used ?ingredient,
          wdt:P186 ?ingredient ;    
          rdfs:label ?cocktailName . 
  ?ingredient rdfs:label ?ingredientName . 
  FILTER (lang(?ingredientName) = "en")
  FILTER (lang(?cocktailName) = "en")
  # substring query so that "lime" finds "lime juice", "lime wedge", etc.
  FILTER(contains(lcase(?ingredientName),lcase(?searchTerm)))
  ?wikipediaURL schema:about ?cocktail . 
  FILTER(contains(str(?wikipediaURL),"/en.wikipedia.org"))
}
ORDER BY ?cocktailName 
';

# Insert the search term into the query
$query =~ s/SEARCHTERM/$searchTerm/;

# Perform the query
my $queryResult = $sparql->query($endpoint,$query);

# Output the result as HTML
print "Content-type: text/html\n\n";
print "<html><head><title>SPARQL Cocktails Results</title>\n";
print "<style type='text/css'> * { font-family: arial,helvetica}</style>\n";
print "</head><body>\n";

if (scalar(@{$queryResult}) == 0) {
    print "No drinks found with $searchTerm as an ingredient.\n";
}
else {
    print "<h2>Cocktails with $searchTerm as an ingredient</h2>\n";
    for my $row (@{$queryResult}) {
	my $wikipediaURL = $row->{'wikipediaURL'};
	my $cocktailName = $row->{'cocktailName'};
	my $ingredientName = $row->{'ingredientName'};

	# Remove delimiters and language tags. 
	$wikipediaURL =~ s/<(.+)>/$1/;
	$cocktailName =~ s/\"(.+)\"\@en/$1/;
	$ingredientName =~ s/\"(.+)\"\@en/$1/;

	print "<p><a href='$wikipediaURL'>$cocktailName</a>:";
	print " $ingredientName</p>\n";
    }
}
print "</body></html>\n";

The SPARQL query is stored in the Perl variable $query, and the script takes the q value passed from the form and replaces the string “SEARCHTERM” in the SPARQL query with that value.

The workings of the query are described by comments within it. It uses sparql.pm from the Perl-SPARQL-client-library library that Steve Harris (a.k.a. @theno23) added there six years ago. It’s nice that when Steve’s library passes the query to the endpoint, the comments cause no problems—I have seen libraries that pass SPARQL queries to endpoints without the carriage returns so that embedded comments screw up the parsing of the query. So, my comments describing how the query works are right in the query instead of here.

CGI scripts have been around since the 1990s and played an important role in the web evolving from static web pages to something more interactive and dynamic. They still work, as you can see, and make it easy to automate the use of SPARQL endpoints for people who’ve never heard of SPARQL or RDF. The layers of UI technology that have been developed since, typically as JavaScript libraries, can of course be incorporated here so that a modern responsive interface can take advantage of SPARQL endpoints on the back end such as Wikidata as well.

If you write an HTML form and CGI script that sends a SPARQL query to a SPARQL endpoint such as Wikidata, let me know. I’d love to see it!

Converting sqlite browser cookies to Turtle and querying them with SPARQL

Bob DuCharme — Sun, 28 Jul 2019 10:00:00 +0000

Because you have more SQLite data than you realized.

There is a reasonable chance that you’ve never heard of SQLite and are unaware that this database management program and many database files in its format may be stored on all of your computing devices. Firefox and Chrome in particular use it to keep track of your cookies and, as I’ve recently learned, many other things. Of course I want to query all that data with SPARQL, so I wrote some short simple scripts to convert these tables of data to Turtle.

From a Linux, Windows, or MacOS command prompt (or from the prompt that the excellent termux app adds to Android phones), type sqlite3 to get to the SQLite prompt. If you enter sqlite3 someFileName it opens that file if it’s an SQLite database or creates one with that name if it doesn’t exist. From the SQLite prompt, the .quit command quits SQLite, .tables lists tables, and .help tells you about the other commands. Other than that, at the prompt you can enter the typical SQL commands to create tables, insert data into them, as well as to query, update, and delete data. (I did a blog entry titled My SQL quick reference several years ago and have since contributed an updated version of it to the excellent Learn X in Y Minutes site.)

A search of my hard disk found dozens and dozens of files whose names ended with .sqlite. I believe it’s an older convention to end SQLite database filenames with .db, and I had some Chrome and Firefox files with that as well. The ~/.config/google-chrome/Default directory had many files that didn’t have .sqlite or .db extensions but still turned out to be SQLite files.

SQLite can execute a series of commands stored in a script. For example, my tableList.scr file has just these two lines,

.tables
.quit

and from my operating system command line I can quickly list the tables in the cookies.sqlite database file that I found in a Firefox directory with this command line:

sqlite3 cookies.sqlite < tableList.scr

The result shows that cookies.sqlite has just one table: moz_cookies.

Once I know what tables are an SQLite file, my sqliteToTSV.sh script pulls the data from a named table within that file and saves it as a Turtle file so that I can query it with SPARQL. (You can find all of the scripts and queries that I wrote for this in github.) If you pass the database filename and table name to this shell script like this,

 sqliteToTSV.sh cookies.sqlite moz_cookies

it first creates an SQLite script that executes an SQL SELECT command to save everything in that table to a tab-separated value file. It then runs a Perl script that converts the TSV file to Turtle. (This shell script should work fine as a Windows batch file with minimal modifications.)

The Perl script would be especially short if I hadn’t found escaped JSON data in some SQLite column values and binary data in others, so I had the script check for those and just output stub labels instead of trying to do anything useful with them. (Note that for all the SQLite files that I played with, I actually played with copies in a new directory, not the originals created by applications such as the browsers.)

The remaining ASCII data still offers plenty of interesting things to look at. My files in the git repo include an ffCookiesHosts.rq SPARQL query that counts how many Firefox cookies come from each base domain and outputs a list sorted by the counts in descending order. Here are the first few lines of output:

-------------------------------------
| baseDomain              | cookies |
=====================================
| "google.com"            | 57      |
| "pubmatic.com"          | 41      |
| "tremorhub.com"         | 32      |
| "verizon.com"           | 29      |
| "cnn.com"               | 22      |
| "verizonwireless.com"   | 21      |
| "nfl.com"               | 19      |

I’m not a big NFL guy, but I do remember that when having some Internet trouble the technician and I were using that site to check connectivity. The big surprises for me were the high scores of two names that I didn’t recognize: pubmatic.com and tremorhub.com. The tremorhub.com domain name redirects to telaria.com, some company that manages “premium” video advertising, which sounds just like the kind of company that would dump cookies on your hard disk without telling you. The pubmatic.com site is about “monetization” of “content”, so they too look like a cookie-dumping ad tech firm.

The googleCookiesHosts.rq query in the git repo performs a similar query on data from the cookies table in the ~/.config/google-chrome/Default/Cookies SQLite database. Its output listed rubiconproject.com as a leading cookie depositor along with sites that I actually visit often; they’re another ad tech firm that I haven’t heard of but has clearly been dumping plenty of cookies onto my hard disk.

I started looking into this so that I could do SPARQL queries about these deposited cookies and it was interesting to see how many other kinds of SQLite files I had. That same google-chrome/Default directory has a History SQLite database file with 11 tables, including keyword_search_terms and visits. (Not all the files in that directory are SQLite files, so the lack of file extensions to indicate which ones are SQLite files is a bit annoying.) After conversion to Turtle, the keyword_search_terms data had triples like this, showing that it had stored my search terms in both the entered case and in lower case:

[
   m:keyword_id "2" ;
   m:url_id "18899" ;
   m:lower_term "coca y sus exploradores lo añoro" ;
   m:term "Coca y Sus Exploradores Lo Añoro" ] .

Thunderbird, Skype, and even iPython have also deposited SQLite files on my Ubuntu laptop’s hard disk.

If I was writing a script to use in an application that used that keyword_search_terms data, then instead of representing the values with blank nodes, I’d probably give the triples above a subject that built on that m:url_id value. When converting SQL or CSV or other tabular data to Turtle before I’ve usually generated URIs to be the subjects; I finally realized that doing it as blank nodes with square brackets, like the example above, is a nice clean way to represent a row from tabular data and a little less trouble.

One note about date formats: the Google cookies (and maybe more SQLite files) store dates in a strange format that I could not work out how to convert to proper ISO 8601 format in my Perl script. I found an explanation on Stack Overflow of how to convert the date formats as part of an SQL SELECT statement that retrieves the data. You could even use the same logic in SQLite to convert the dates within (a copy of!) the database file itself so that my generic Turtle extraction script would pull out more readable dates. For example, the following example of an SQL UPDATE command does this to the lastAccessed column of a moz_cookies table:

UPDATE moz_cookies SET lastAccessed = datetime(lastAccessed / 1000000, 'unixepoch');

Overall, it’s cool to see how much data is spread around our hard disks using SQLite so that, after some simple scripting, we can explore it with SPARQL.

Querying geospatial data with SPARQL

Bob DuCharme — Sun, 30 Jun 2019 10:00:00 +0000

Part 1.

OpenStreetMap, or “OSM” to geospatial folk, is a crowd-sourced online map that has made tremendous achievements in its role as the Wikipedia of geospatial data. (The Wikipedia page for OpenStreetMap is really worth a skim to learn more about its impressive history.) OSM offers a free alternative to commercial mapping systems out there—and you better believe that the commercial mapping systems are reading that great free data into their own databases.

OSM provides a SPARQL endpoint and a nice page of example queries. With their endpoint, the following query lists the names and addresses of all the museums in New York City (or, in RDF terms, everything with an osmt:addr:city value of “New York” and an osmt:tourism value of “museum”):

SELECT ?name ?housenumber ?street 
WHERE {
   ?museum osmt:addr:city "New York";
      osmt:tourism "museum";
      osmm:loc ?loc ;
      osmt:name ?name ;
      osmt:addr:housenumber ?housenumber ;
      osmt:addr:street ?street .
      # The following tells it to only get museums south of the Javits Center
      # FILTER(geof:latitude(?loc) < 40.758289)
}

You can try it here. As I write this, it returns 32 results, and if you uncomment the filter condition to only get museums south of that latitude, it returns 17. That filter condition is just a taste of actually using geospatial data; the osm:loc value has a type of http://www.opengis.net/ont/geosparql#wktLiteral and takes a form like Point(-73.9900266 40.7187837). As you can see, the filter with the geof:latitude() function to pull the latitude value out out of the Point value.

This is a very basic level of geospatial data use. A proper geospatial query for something like all the museums within a mile of the Museum of Modern Art is more complicated because of the effect of the earth’s curvature. Although OSM stores each entity’s latitude and longitude values, its query engine doesn’t support such queries. (The How OSM Data is Stored documentation of their SPARQL endpoint is good if you want to explore their SPARQL endpoint more.)

The ability to execute real geospatial queries typically comes from an add-in to most databases. For example, if you already use Oracle for your relational data, you pay extra for Oracle Spatial. If you’re using the open source PostgreSQL relational database, you get the open source PostGIS add-in. Even little SQLite has SpatiaLite. (If you’re storing massive amounts of data using Apache Accumulo on a Hadoop platform, the add-in would be the open source GeoMesa suite developed at my employer CCRi. Being around this project has taught me a lot about the issues of geospatial processing.)

The LinkedGeoData.org project from the University of Leipzig’s Agile Knowledge Engineering and Semantic Web (AKSW) research group “uses the information collected by the OpenStreetMap project and makes it available as an RDF knowledge base according to the Linked Data principles”. It includes a SPARQL endpoint but I could find no documentation or examples of geospatial extensions to SPARQL. The endpoint is currently up and running, but the “About/News” page shows no activity on the project since May of last year. (A query for resources with an rdfs:label of “Grand Central Station” returned the URIs http://linkedgeodata.org/triplify/node291087340 and http://linkedgeodata.org/triplify/way189853520, both of which returned HTTP 500. )

A standardized extension for SPARQL called GeoSPARQL specifies how to make queries about spatial information in which you can do things like specify criteria in terms of miles or kilometers, and a SPARQL engine that supports this standard will do the necessary trigonometry to give you the right answers. GeoSPARQL is sponsored by the Open Geospatial Consortium, who is also responsible for other popular geospatial standards such as the Web Feature Service and Web Map Service standards for REST API access to geospatial data. I have used both often at work. Looking at their standards page, I only just now learned that they are also the standards body behind KML. Their GeoSPARQL Functions page documents the extension functions. (I have co-workers who understand what the mathematical concept of a “convex hull” is; I have tried with little success.)

The geosparql.org website has some preloaded data where you can try GeoSPARQL queries. I wanted to explore the possibilities of using a geospatial SPARQL extension, ideally GeoSPARQL, with data that I could control. Because I just love converting triples from one namespace to another so that I can use new tools and standards with them, I hoped to get some OSM triples and convert them to the right namespaces to enable geospatial queries on them using a local triplestore. I decided that a simpler first step would be to pull down some triples from geosparql.org and load those into the local triplestore, because I already knew that those would work with standard GeoSPARQL queries.

The two downloadable triplestores that I could find that claimed geospatial support were Blazegraph and Parliament. (Blazegraph’s 2010 slides “Geospatial RDF Data Geospatial RDF Data” (pdf) provide a good introduction to issues of geospatial indexing.) I got the sample query to run against their sample data as described on their Querying Geospatial Data page, but I had no luck when I tried to modify it to work with data that I had loaded into it. The geo:predicate triple in their sample query seems to be necessary, but I wasn’t querying for both location and time like their example does, and although I tried different objects for a triple using this property I couldn’t get it to work and gave up. (Since Amazon Neptune’s acqui-hire of most if not all of the Blazegraph staff, it doesn’t seem to be under active development anyway.)

Parliament comes from Raytheon subsidiary BBN, a company with a long history in important computer technology. This triplestore promised not just geospatial support but support for the GeoSPARQL standard. I got Parliament up and running locally and found a localhost page about indexes that showed that the data I was using did not have a geospatial index, and I saw no way to create one. Their five-year-old User Guide (PDF) has a “Configuring Indexes” section consisting of the four words “Yet to be written”. I gave up on Parliament after some LinkedIn searches showed that the main people attached to the project are no longer at BBN.

In the middle of all this research I learned some great news: Apache Jena had some geospatial support that required the use of Lucene or Solr, the use of a custom querying vocabulary, and a lot of manual index configuration, but they are now ramping up code development on direct support for the GeoSPARQL standard. That’s why this blog entry has a subtitle of “Part 1”, and I look forward to trying out GeoSPARQL in a locally running copy of Jena’s Fuseki server and then writing Part 2. (I’m going to be patient as I wait for it to be included in the binary release of Fuseki—or to put it another way, I’m too lazy to set up the environment to build it from the current source.) And, once a SPARQL 1.2 Recommendation gets closer and I update my book Learning SPARQL, I thought it would be a good idea to cover GeoSPARQL, so I’ll be happy to see support for SPARQL’s standardized geospatial extension in the triplestore that is already used in many of the book’s examples.

Converting JSON-LD schema.org RDF to other vocabularies

Bob DuCharme — Sun, 12 May 2019 11:20:00 +0000

So that we can use tools designed around those vocabularies.

Once you've got data in any standardized RDF syntax, you can convert it to use whatever namespaces you want.

Last month I wrote about how we can treat the growing amount of JSON-LD in the world as RDF. By “treat” I mean “query it with SPARQL and use it with the wide choice of RDF application development tools out there”. While I did demonstrate that JSON-LD does just fine with URIs from outside of the schema.org vocabulary, the vast majority of JSON-LD out there uses schema.org.

Some people fret about the “one schema to rule them all” approach. I don’t worry so much because one of the great things about RDF is that once you’ve got data in any standardized RDF syntax, you can convert it to use whatever namespaces you want. Today we’ll see how I did this so that I could load JSON-LD schema.org metadata from my blog into a SKOS visualization tool.

I also mentioned last month that the Hugo platform that I recently started using for my blog, in its default configuration, automatically generates JSON-LD metadata about my blog entries. The old Movable Type platform that I formerly used let me assign categories and tags to the entries, so when I migrated the old entries I brought those along.

Here is an excerpt of some metadata from one of my blog entries after I converted it from JSON-LD to Turtle:

[ a                         schema:BlogPosting ;
  schema:author             "Bob DuCharme" ;
  schema:datePublished      "2019-02-24 10:45:30 -0500 EST"^^schema:Date ;
  schema:description        "A quick reference." ;
  schema:headline           "curling SPARQL" ;
  schema:inLanguage         "en" ;
  schema:keywords           "SPARQL" , "curl" , "Blog" ;
  schema:name               "curling SPARQL" ;
  schema:url                <http://www.bobdc.com/blog/curling-sparql/> .
] .

I wanted to convert that to RDF that met three conditions:

Instead of using a blank node as the subject, use the schema:url value included with the data.
Define SKOS concepts for each schema:keywords value that the metadata uses.
Use Dublin Core properties to connect as much of the metadata as possible, including the SKOS concepts, to the posting.

This SPARQL query made this all quite straightforward:

# convertTriples.rq

PREFIX schema: <http://schema.org/>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {
  ?url   dc:title ?name ;
         dc:creator ?author ;
         dc:description ?description ;
         dc:subject ?kwURI . 
  ?kwURI a skos:Concept ;
         skos:prefLabel ?keyword . 
}
WHERE {
  ?entry schema:url ?url ;
         schema:name ?name ;
	 schema:author ?author ;
	 schema:description ?description .
	 OPTIONAL {
	    ?entry schema:keywords ?keyword . 
	    BIND(URI(concat("http://bobdc.com/tags/",?keyword))
	        AS ?kwURI)
	 }
}

The WHERE clause, as always, grabs the needed values. Instead of assuming that every blog entry has keywords assigned, I put the part that handles those inside of an OPTIONAL clause.

That part also creates a URI from each schema:keywords value to be the identity for the SKOS concept built from that keyword. To do that, I originally concatenated the values onto the base URI http://bobdc.com/blog/kwords/ that I just made up, but when I noticed that Hugo creates pages for each keyword at http://www.bobdc.com/tags/ I realized something nice: instead of a base URI that I made up from scratch, using the Hugo-generated one would give me dereferenceable URIs for the SKOS concepts. (For example, my earlier blog entries about DBpedia are assigned the keyword “dbpedia”, which becomes the URI http://www.bobdc.com/tags/dbpedia of a concept about DBpedia, and you can click that URI to see a list of those blog entries.) So I used that as the base URI when creating the URI for each new SKOS concept.

The CONSTRUCT clause follows through on the tasks listed in my bulleted list above.

SKOS is typically used to arrange topics into hierarchies, so that if for example your SKOS vocabulary says that “collie” has a broader value of “dog” and you’re looking for articles about dogs, you can retrieve all the ones tagged with “dog” or with any of the values in the SKOS subtree below “dog” such as “collie”. After running the query above I had a list of concepts with no hierarchy, so I created one. Of course there are GUI tools that let you click and drag to turn such a list into a hierarchy; the use of these tools is one of the reasons for converting the schema.org keyword metadata into SKOS metadata. Instead of using one of these tools, though, I found it simpler to just type out a text file with lines like this,

XSLT broader XML 
XBRL broader XML 
mysql broader SQL 
audio broader music 
bass broader music 
D2RQ broader RDF 
RDFa broader RDF

and then, after typing in some namespace declarations at the very top, doing a few global replacements to turn it into this:

bt:XSLT skos:broader bt:XML . 
bt:XBRL skos:broader bt:XML . 
bt:mysql skos:broader bt:SQL . 
bt:audio skos:broader bt:music . 
bt:bass skos:broader bt:music . 
bt:D2RQ skos:broader bt:RDF . 
bt:RDFa skos:broader bt:RDF .

Useful data modeling can sometimes be simple.

I added these triples to the result of the CONSTRUCT query above and loaded the resulting SKOS into the wonderful SKOS Play! site’s visualizer. (Pun intended?) My not-very-controlled-vocabulary had a lot of orphan elements, so to make a nicer visualization I used the following query to pull, from the result of the earlier CONSTRUCT query, only SKOS concepts taking part in some hierarchy:

# getChildrenAndParents.rq

PREFIX skos:  <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {
  ?child ?childP ?childO .
  ?parent ?parentP ?parentO .
}
WHERE {
  ?child skos:broader ?parent .
  ?child ?childP ?childO .
  ?parent ?parentP ?parentO .
}

Nothing OPTIONAL in that query!

With the results of the first convertTriples.rq CONSTRUCT query in a file called convertedTriples.ttl and the additional skos:broader triples in the additionalModeling.ttl file, I had the Jena arq command line tool run this new query on the combined data to create something to load into SKOS Play:

arq --query getChildrenAndParents.rq --data convertedTriples.ttl --data additionalModeling.ttl > conceptTrees.ttl

arq’s ability to accept multiple --data arguments (potentially each using different RDF syntaxes!) can be very handy sometimes.

On the SKOS Play Play page, I used the local file option to upload the conceptTrees.ttl file created by the arq command line shown above. (The page includes some options that look fun to play with: Infer on subclasses and subproperties, Handle SKOS-XL properties, and Transform an OWL ontology to SKOS.)

When I clicked the page’s orange Next button, the site parsed my uploaded file, told me how many concepts it found, and offered some options for how to display it. I went with the default Visualize option of Tree Visualization, which displayed the skosplay:allData node you see on the left below and the first row of nodes to the right of that. Clicking blue nodes displays their child nodes, and you can see the result after I clicked the RDF and XML ones.

SKOS concepts use URIs as their identity and skos:prefLabel values to show human-readable values in as many languages as you like. You can see that the SKOS Play diagram uses skos:prefLabel values when available, and the full URLs at the top of my diagram show that a few concepts still need skos:prefLabel values. (The convertTriples.rq query created them for most concepts.) It’s a nice example of how such tools can help us identify ways to improve our data, but of course a query for concepts that lack skos:prefLabel values would be easy enough.

I didn’t even do anything with the triples that I converted to use the Dublin Core vocabulary, but as a long-standing popular standard, there are plenty of tools out there that can work with it. They can help to make an even better case that if schema.org JSON-LD triples don’t conform to the vocabulary that you want to use, just convert them!

Exploring JSON-LD

Bob DuCharme — Sun, 21 Apr 2019 11:20:00 +0000

And of course, querying it with SPARQL.

I paid little attention to JSON-LD until recently. I just thought of it as another RDF serialization format that, because it’s valid JSON, had more appeal to people normally uninterested in RDF. Dan Brickley’s December tweet that “JSON-LD is much more widely used than Turtle” inspired me to look a little harder at the JSON-LD ecosystem, and I found a lot of great things. To summarize: the amount of JSON-LD data out there is exploding, and we can query it with SPARQL, so it offers many new possibilities for RDF-based applications.

JSON-LD structure

The primer on the json-ld.org site is a good way to get a quick introduction to the syntax. The W3C’s RDF AND JSON-LD UseCases document has a Differences with RDF section that provides a nice summary for people coming to JSON-LD from the RDF world.

To get to know the JSON-LD syntax, I created a Turtle file with examples of some trickier RDF features and then converted it to JSON-LD to see what it looked like. My Turtle:

    @prefix ab:   <http://learningsparql.com/ns/sample#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix v:    <http://www.w3.org/2006/vcard/> .
    @prefix dc:   <http://purl.org/dc/elements/1.1/> .
    
    # Sample comment: I wish I could get Hugo to do syntax highlighting of Turtle!
    
    ab:i432 ab:firstName     "Richard" ;
            ab:lastName      "Mutt" ;
            ab:startYear     2013 ;
            ab:officer       true ;
            ab:reportsTo     ab:i193 ;
            ab:linkedIn      <https://www.linkedin.com/in/rmutt> ;
            ab:address       _:b1 .
    
    _:b1    ab:city          "Springfield" ;
            ab:streetAddress "32 Main St." .
    
    ab:i193 ab:firstName     "Joan" ;
            ab:lastName      "Jones" ;
            v:title "Director"@en ; 
            v:title "Directeur"@fr .
            
    <urn:isbn:123456789X> dc:creator ab:i193 ;
            dc:title "Chicken Soup for the JSON-LD Soul" . 
    
    ab:firstName rdfs:label  "first name" .

It has some information in different types about employee Richard Mutt with an object property to identify his boss and a blank node to hold together the details of his address. Triples about his boss list her job title in both English and French; they also show her as the author of a book whose name is specified with a “title” property in a different namespace from the property identifying her job title.

The Jena command line utilities that I currently have installed don’t write JSON-LD (as we’ll see, they can read it) (2021 update: they do now, for example with riot --syntax=jsonld) so I used the easyrdf.org website to convert the Turtle sample above to JSON-LD. I’m tempted to include a screen shot of the result—it was a dense mass without a single carriage return, showing how the JSON-LD home page assertion that JSON-LD is “easy for humans to read and write” should be qualified with “if you add carriage returns and indenting in all the right places”. Of course, just about any programming or markup language is easy for humans to read and write if you add white space in all the right places, so this does not make JSON-LD special. (I do find it amusing when a set of software developers generalize from themselves to their entire species.)

The jq utility nicely converted the easyrdf output into something easier for humans to read. Here is the result:

    [
      {
        "@id": "_:b0",
        "http://learningsparql.com/ns/sample#city": [
          {
            "@value": "Springfield"
          }
        ],
        "http://learningsparql.com/ns/sample#streetAddress": [
          {
            "@value": "32 Main St."
          }
        ]
      },
      {
        "@id": "http://learningsparql.com/ns/sample#firstName",
        "http://www.w3.org/2000/01/rdf-schema#label": [
          {
            "@value": "first name"
          }
        ]
      },
      {
        "@id": "http://learningsparql.com/ns/sample#i0193"
      },
      {
        "@id": "http://learningsparql.com/ns/sample#i193",
        "http://learningsparql.com/ns/sample#firstName": [
          {
            "@value": "Joan"
          }
        ],
        "http://learningsparql.com/ns/sample#lastName": [
          {
            "@value": "Jones"
          }
        ],
        "http://www.w3.org/2006/vcard/title": [
          {
            "@value": "Director",
            "@language": "en"
          },
          {
            "@value": "Directeur",
            "@language": "fr"
          }
        ]
      },
      {
        "@id": "http://learningsparql.com/ns/sample#i432",
        "http://learningsparql.com/ns/sample#firstName": [
          {
            "@value": "Richard"
          }
        ],
        "http://learningsparql.com/ns/sample#lastName": [
          {
            "@value": "Mutt"
          }
        ],
        "http://learningsparql.com/ns/sample#startYear": [
          {
            "@value": 2013
          }
        ],
        "http://learningsparql.com/ns/sample#officer": [
          {
            "@value": true
          }
        ],
        "http://learningsparql.com/ns/sample#reportsTo": [
          {
            "@id": "http://learningsparql.com/ns/sample#i0193"
          }
        ],
        "http://learningsparql.com/ns/sample#linkedIn": [
          {
            "@id": "https://www.linkedin.com/in/rmutt"
          }
        ],
        "http://learningsparql.com/ns/sample#address": [
          {
            "@id": "_:b0"
          }
        ]
      },
      {
        "@id": "https://www.linkedin.com/in/rmutt"
      },
      {
        "@id": "urn:isbn:123456789X",
        "http://purl.org/dc/elements/1.1/creator": [
          {
            "@id": "http://learningsparql.com/ns/sample#i193"
          }
        ],
        "http://purl.org/dc/elements/1.1/title": [
          {
            "@value": "Chicken Soup for the JSON-LD Soul"
          }
        ]
      }
    ]

Except for JSON’s inability to store comments, the converted version shows that JSON-LD managed to represent all the tricky RDF bits that I included in the input.

With the Jena arq command line tool I successfully executed the following SPARQL query against the JSON-LD data above:

CONSTRUCT { ?s ?p ?o } WHERE
{ ?s ?p ?o }

The query simply asks for all the triples. My arq command line asked for output in the default format of Turtle, and it worked fine.

There are two bits of big news here for RDF people evaluating JSON-LD:

I round-tripped some fairly complex RDF in and out of JSON-LD with no loss of anything but the comment.
I performed a SPARQL query on JSON-LD. This demonstrates that the exploding amount of JSON-LD out there is available for use in RDF applications.

SPARQL queries of public JSON-LD

Next I queried some non-demo real-world data. The overstock.com website has rich JSON-LD data about all of their products and even includes some nice JSON-LD in their search results pages. After searching the site for “headphones” and pulling the JSON-LD from the first page of search results, I wrote a script to pull the JSON-LD for the 60 or so products listed there.

The aggregated data has 8,808 triples with 27 different predicates. If you do a View Source on the web page of a typical entry from the headphones list (search for “ld+json”) you’ll see that its JSON-LD provides more than just a product name and images—it includes a full paragraph of description, pricing, reviews, availability, and more.

The following query of that data requests the price and name (but not description) of any headphones under $30 that include “Bluetooth” in their description:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
PREFIX s:   <http://schema.org/> 

SELECT ?price ?name WHERE {
   ?i a s:Product ;
   s:name ?name ;
   ?offers ?offer ; 
   s:description ?description .
  
   ?offer s:price ?price .
   FILTER(contains(?description,"Bluetooth")) 
   FILTER(xsd:decimal(?price) < 30) 
}

Here is the result:

--------------------------------------------------------------------------------------------------------------------------------------------
| price   | name                                                                                                                           |
============================================================================================================================================
| "16.49" | "BL1 Mini Bluetooth Monaural Headphone Stereo Wireless Stealth Business Wireless Bluetooth 4.1 Headphones"                     |
| "11.24" | "Mini Wireless Bluetooth 4.0 Stereo In-Ear Headset (Black)"                                                                    |
| "13.49" | "Mpow EM 13 Mini Wireless Earbud, Bluetooth V4.1 Invisible Earphone"                                                           |
| "21.99" | "X18 Wireless Bluetooth Earbuds Headphones Stereo Sound Built-in 6.0 Noise Cancelling Mic"                                     |
| "20.99" | "Mpow Bluetooth Headphones V4.1 Wireless Sport Headphones Noise Cancelling In-ear Stereo Earbuds 8-hour Playing Time with Mic" |
--------------------------------------------------------------------------------------------------------------------------------------------

Of course, all of the properties use the schema.org vocabulary, but RDFS gives us ways to map this data as to other more specialized vocabularies. I’ll show some of that next time; above, the casting of the price from a string datatype to a decimal value is one taste of how SPARQL can turn the data into something more useful.

A bright future

It’s been a pleasant surprise to see how many different sites include JSON-LD these days. The Hugo website generation framework that I wrote about migrating to last month adds JSON-LD metadata by default, so my new blog website had JSON-LD before I even knew it did. I’ve also been surprised by how popular JSON-LD is with the search engine optimization crowd—a Google search for JSON-LD SEO gets over 200,000 hits, and many don’t even mention RDF. They just see it as a way to add metadata that Google’s crawlers are more likely to notice.

While I’m currently only interested in JSON-LD as a growing source of data that I can query with SPARQL, there are some interesting things happening with the syntax and structure of JSON-LD itself. Greg Kellogg’s JSON-LD 1.1 Update gives a nice overview of the additions to JSON-LD that are being considered. I certainly plan to play with it more.

Changing my blog's domain name and platform

Bob DuCharme — Sun, 24 Mar 2019 09:00:48 +0000

New look, new domain name.

For too long I’ve postponed the migration of my blog to something more phone-friendly. I accumulated many notes about doing this, and I also wanted to move more of my online life from the snee.com domain to bobdc.com. When someone recently asked me about changing the stylesheet (I have dug and dug in the aforementioned notes but can’t remember who and will add their name here if I ever find it) I thought I’d take a deep breath and follow through with this. This is the last new blog entry you’ll see on the snee.com domain; you’ll also find it at bobdc.com/blog along with converted versions of all my other blog entries since I started on snee.com/bobdc.blog in 2005. I will continue my blog on bobdc.com/blog after this entry.

The conversion of the old entries was most of the work, but with some Perl and XSLT and pandoc and spit and duct tape I got the legacy content into pretty good shape for the new platform.

Of course, the platform choice was a geeky thing to agonize over. I finally went with Hugo, a Go-based static site generator. (I never had to learn the Go programming language, but it looks cool enough.)

It’s a bit scary to think of the high percentage of the world’s blog entries that are created by data entry into web forms that then use a bunch of PHP to manage that content’s storage in relational databases. Having spent much of my career helping people store non-tabular content in standards-based non-tabular storage tools, I definitely wanted to get away from using PHP and relational database managers for narrative content, so I researched various static site generators before settling on Hugo.

Simple web sites like my learningsparql.com and datascienceglossary.org sites are just plain static sites: HTML files that I edit as necessary. A static site generator lets you store content separate from the styling and then generates HTML for your site based on the combination. If you want to change your website’s layout or styling, you edit the CSS or whatever and then regenerate the HTML. (The version of MovableType that I used on snee.com actually did static site generation, but all the styling was managed with a mess of old PHP. I haven’t upgraded it in ten years because the last time I did it broke so much.) A selling point of Hugo is that it does this very quickly–or, to use the now-clichéd phrase that they prefer, “blazingly fast”.

I knew about Jekyll and Sphinx from work because both are used for geomesa.org. After researching alternatives I decided that I liked the available Hugo themes the most. The Hugo documentation isn’t very good, but the people on the discussion forum are very helpful, sometimes answering within minutes. If there is any interest I may write a blog entry about the important Hugo techniques I had to track down to customize my blog because they were not written up in an easily findable place.

You store your Hugo content separately from the styling using Hugo’s own variation of markdown. As a longstanding XML guy ever since it was a four-letter word, I have ranted about what’s wrong with markdown–or, as I should say, “the markdowns”– but it works for what I want to do in my blog and you can embed just about any sensible HTML you want in places where markdown falls short. I would have preferred a static site generator where the content I wrote for each new blog entry conformed to some simple XHTML profile but I just couldn’t find anything with good themes and the right level of automation.

In the lower-right of my snee.com blog you’ll see four variations on Atom and RSS feeds. More than one Atom or RSS feed seems to be difficult in Hugo, so my new blog’s Atom feed has summaries and links to the original postings and the new blog’s RSS feed has the full entries. I will be setting the snee.com ones to redirect to the bobdc.com ones shortly, but you can just subscribe to the new ones now if you like.

So, I apologize for the lack of phone-friendliness of my blog for the last few years and hope you enjoy the new more responsive version of my blog.

curling SPARQL

Bob DuCharme — Sun, 24 Feb 2019 10:45:30 -0500

A quick reference.

I’ve been using the curl utility to retrieve data from SPARQL endpoints for years, but I still have trouble remembering some of the important syntax, so I jotted down a quick reference for myself and I thought I’d share it. I also added some background.

Quick reference

Submit a URL-encoded SPARQL query on the operating system command line to the endpoint http://edan.si.edu/saam/sparql:

curl "http://edan.si.edu/saam/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208"

(Quoting the URL isn’t always necessary, but won’t hurt. Omitting it may hurt if some of the characters mean something special to your operating system’s command line interpreter.)

Submit the same query stored in the file query1.rq:

curl --data-urlencode "query@query1.rq" http://edan.si.edu/saam/sparql

There is no need to escape the query in the file, because the --data-urlencode parameter tells curl to do so.

The above queries return the data in whatever format the endpoint’s system administrators chose as the default. You can pass a request header to specify that you want a particular format. The following requests comma-separated values:

curl -H "Accept: text/csv" --data-urlencode "query@query1.rq"  http://edan.si.edu/saam/sparql

Other possible content types are application/sparql-results+json, application/sparql-results+xml, and text/tab-separated-values.

The above examples all use a SELECT query. A CONSTRUCT query requests triples, so instead of CSV or one of the other tabular formats you want an RDF serialization such as Turtle:

curl -H "Accept: text/turtle" --data-urlencode "query@query2.rq"  http://edan.si.edu/saam/sparql

Other possible content types for CONSTRUCT queries are application/rdf+xml, application/rdf+json, and, for ntriples, text/plain. The bio2rdf github page has good long lists for both SELECT and CONSTRUCT content types, although not all endpoints will support all of the listed types. (It lists text/plain for N-triples, but you’re better off using application/n-triples.)

Background

curl lets you submit many kinds of HTTP requests to HTTP servers. It’s part of the Linux and MacOS operating systems, and if you don’t have it on your Windows machine, you can download it.

If you enter curl with no parameters other than a URL, like this,

curl http://www.learningsparql.com

it does the same HTTP GET that a browser would do. This has the same effect as doing a browser View Source on that web page.

It gets more interesting when you’re not pointing curl at a static web page like http://www.learningsparql.com but at a dynamic resource such as a SPARQL endpoint. A SPARQL endpoint is usually identified with a URL ending with /sparql. I tested everything shown above with these endpoint URLs:

https://query.wikidata.org/bigdata/namespace/wdq/sparql, the SPARQL endpoint for Wikidata.
http://localhost:3030/myDataset/sparql, the SPARQL endpoint for a local instance of Apache Jena Fuseki. This is the triplestore that I described in the “Updating Data with SPARQL” chapter of my book Learning SPARQL because, for a server that accepts SPARQL UPDATE commands, it’s so easy to get up and running. Before running the queries against this endpoint I created a dataset on this running instance with the clever name of myDataset and loaded some triples into it. As you can see, a Fuseki endpoint URL includes the dataset name.
http://edan.si.edu/saam/sparql, the SPARQL endpoint for the Smithsonian Institution. I used this one in the examples here because it’s the shortest of the three endpoint URLs that I used for testing.

The simplest way to send a query to a SPARQL endpoint is to add query=[your URL-encoded query] to the end of the endpoint’s URL as with the very first example above. You can paste the resulting URL into the address bar of a web browser so that the browser will retrieve the query results from the endpoint, but curl lets you retrieve the results from a command line so that you can save the returned data and use it as part of an application.

URL encoding is the process of taking characters that might screw up the parsing of the URL and converting each to a percent sign followed by a number representing its Unicode code point–most often, converting each space to %20. For example, the escaped version of the query SELECT * WHERE {?s ?p ?o} LIMIT 8 that I used in the examples above is SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208. Most programming languages offer built-in functions to do this; I usually paste one of these queries into a form on a website like this one and then copy the result after having the form do the conversion.

When you add the escaped query to a SPARQL endpoint URL such as the Smithsonian one and enter the result as a parameter to curl at your command line, like this,

curl http://edan.si.edu/saam/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208

it should retrieve a SPARQL Query Results JSON Format version of the data requested by that query, because that’s the default format for that endpoint.

I actually don’t escape queries and add them to a curl command line often. When I’m refining a query by iteratively editing and running it, re-encoding the URL each time can be a pain, so I usually store the query in a text file (query1.rq for the sample SELECT query above and query2.rq for the CONSTRUCT query) and tell curl to URL-encode the file’s contents and send the result off to the SPARQL endpoint.

If I keep the file with the query in a text editor, I can refine it, save it, and run the same command over and over without worrying about escaping each revision of the query. (Because my editor is Emacs, I could actually send the query to the endpoint using Emacs SPARQLMode, but today’s topic is curl.)

The curl website has plenty of documentation, but you can learn a lot with just this:

  curl --help

Among the many, many options, some useful ones are -o to redirect output to a file and -L for “follow location hints” (that is, if the server has instructions to redirect a request for a given URL to something else, take the hint). Another is-I for “Show document info only”: just get information about the requested “document” without actually retrieving a named resource, which is useful for debugging. The classic -v for “verbose” is also handy for debugging.

Take a look at the available options, experiment with some SPARQL endpoints, and soon you’ll be using “curl” as a verb (for example, “I tried to curl it but I didn’t have the right certs”–see the -E command line option for more on that) and you won’t be talking about hairstyling, arm exercises, or sliding round stones across the ice.

(I just learned about Curling SPARQL HTTP Graph Store protocol by @jindrichmynarz, so if you’ve gotten this far, you’ll like that too.)

Curling image by Greg Scheckter via flicker CC some rights reserved

Comments? Just tweet to @bobdc for now, because Google+ is shutting down. I will be moving my blog to a new more phone-responsive platform shortly and I’m researching options for hosted comments.

Querying machine learning distributional semantics with SPARQL

Bob DuCharme — Sun, 20 Jan 2019 09:57:40 -0500

Bringing together my two favorite kinds of semantics.

I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

When I wrote Semantic web semantics vs. vector embedding machine learning semantics, I described how distributional semantics–whose machine learning implementations are very popular in modern natural language processing–are quite different from the kind of semantics that RDF people usually talk about. I recently learned of a fascinating project that brings RDF technology and distributional semantics together, letting our SPARQL query logic take advantage of entity similarity as rated by machine learning models.

To review a little from that blog entry: machine learning implementations of distributional semantics can identify some of the meanings of words by analyzing their relationships with other words in a set of training data. For example, after analyzing the distribution of terms in a large enough text corpus, such a system can answer the question “woman is to man as queen is to what?” Along with the answer of “king”, discussions of this technology typically bring up other examples such as the questions “walking is to walked as swimming is to what?” (an especially nice one because “swim” is an irregular verb) and “London is to England as Berlin is to what?”

These examples are a bit oversimplified. Instead of such a straightforward answer, an implementation such as word2vec typically responds with a list of scored words. If the analyzed corpus was large enough, asking word2vec to complete the second pair in “woman man queen” will get you a list of words with “king” having the highest score. In my experiments, this was nice for the “london england berlin” case, because while germany had the highest score, prussia had the second highest, and Berlin was the capital of Prussia for a few centuries.

word2vec doesn’t actually compare the strings “london” and “england” and “berlin”. It uses cosine similarity to compare vectors that were assigned to each word as a result of the training step done with the input corpus–the machine “learning” part. Then, it looks for vectors whose similarity to the berlin vector is comparable to the similarity between the london and england vectors.

Some of the most interesting work in machine learning of the past few years has built on the use of vectors to represent entities other than words. The popular doc2vec (originally implemented by my CCRi co-worker Tim Emerick) does it with documents, and others have done it with audio clips and images.

It’s one thing to pick out an entity and then ask for a list of entities whose vectors are similar to that of the selected entity. Researchers at King Abdullah University of Science and Technology, the University of Birmingham, and Maastricht University have collaborated to take this further by mixing in some SPARQL. Their paper Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings describes “a general framework for integrating structured data and their vector space representations [that] allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query”. They have made their implementation available as a Docker image and also put up a SPARQL endpoint with their sample data and SPARQL extensions.

Vec2SPARQL lets you use SPARQL to move beyond simple comparison of vector similarity scores to combine SPARQL’s abilities with this. As they write,

For example, once feature vectors are extracted from images, meta-data that is associated with the images (such as geo-locations, image types, author, or similar) could be queried using SPARQL and combined with the semantic queries over the feature vectors extracted from the images themselves. Such a combination would, for example, allow to identify the images authored by person a that are most similar to an image of author b; it can enable similarity- or analogy-based search and retrieval in precisely delineated subsets; or, when feature learning is applied to structured datasets, can combine similarity search and link prediction based on knowledge graph embeddings with structured queries based on SPARQL.

The paper’s authors extended Apache Jena ARQ (the open source cross-platform command line SPARQL processor that I recommend in my book Learning SPARQL) with two new functions that make it easier to work with these vectors. The similarity(?x,?y) function lets you compute the similarity of two vectors so that you can use the result in a FILTER, BIND, or SELECT statement. For example, you might use it in a FILTER statement to only retrieve resources whose similarity to a particular resource was above a specified threshold. Their mostSimilar(?x,n) function asks for the n most similar entities to the one passed as the first argument.

Their paper discusses two applications of Vec2SPARQL, in which they “demonstrate using biomedical, clinical, and bioinformatics use cases how [their] approach can enable new kinds of queries and applications that combine symbolic processing and retrieval of information through sub-symbolic semantic queries within vector spaces”. As they described the first of their two examples,

…we can use Vec2SPARQL to perform queries of a knowledge graph of mouse genes, diseases and phenotypes and incorporate Vec2SPARQL similarity functions… Our aim in this use case is to find mouse gene associations with human diseases by prioritizing them using their phenotypic similarity, and simultaneously restrict the similarity comparisons to genes and diseases with specific properties (such as being associated with a particular phenotype).

The paper describes where they got their data and how they prepared it, and it shows a brief but expressive query that let them achieve their goal.

In their second example, after assigning vectors to over 112,000 human chest x-ray images that also included gender, age, and diagnosis metadata, they could query for image similarity and also add filters to these queries such as combinations of age range and gender to find other patterns of similarity.

The paper goes into greater detail on the data used for their samples and the similarity measures that they used. It also points to their source code on github and a “SPARQL endpoint” at http://sparql.bio2vec.net/ that is really more of a SPARQL endpoint query form. (The actual endpoint is at http://sparql.bio2vec.net/patient_embeddings/query, and I successfully sent a query there with curl.)

For an academic paper, “Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings” is quite readable. (Although I didn’t have the right biology background to closely follow all the discussions of their sample query data, I could just about handle the math as shown.) I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

Playing with wdtaxonomy

Bob DuCharme — Sun, 23 Dec 2018 09:51:49 -0500

Those queries from my last blog entry? Never mind!

After I wrote about Extracting RDF data models from Wikidata in my blog last month, Ettore Rizza suggested that I check out wdtaxonomy, which extracts taxonomies from Wikidata by retrieving the kinds of data that my blog entry’s sample queries retrieved, and it then displays the results as a tree. After playing with it, I’m tempted to tell everyone who read that blog entry to ignore the example queries I included, because you can learn a lot more from wdtaxonomy.

The queries in that blog entry might still give you some useful perspective on how SPARQL can retrieve triples from Wikidata that express tree-ish relationships between the concepts of a given domain that have Wikipedia pages–whether you want to call that a taxonomy or an ontology–but I was just dabbling, while wdtaxonomy is a full-featured serious application for this.

Jakob Voss designed wdtaxonomy as both a command line utility and as an NPM module that you can reference from applications. I tried the command line version and had a lot of fun. To try it with my periodic table element example that I wrote about last month, I started by entering “wdtaxonomy Q11344” (using the same local name for the Wikidata identifier that I used before) and the results were impressive.

wdtaxonomy typically outputs a text-based tree with various information about the nodes of the tree. Instead of pasting a sample here, I’m showing a screen shot of the beginning of the output so that you can see the nice color coding:

The wdtaxonomy readthedocs.io documentation lists over two dozen command line options that you can use to customize the output. (Entering “wdtaxonomy” alone at the command line gives a good summary.) My favorite is -s, which tells you you the SPARQL query that wdtaxonomy would use to retrieve the requested information from wikidata. Here is what that gives you when added it to the Q11344 command line I entered above:

$ wdtaxonomy -s Q11344
  SELECT ?item ?broader ?itemLabel ?instances ?sites WITH {
    SELECT DISTINCT ?item { ?item wdt:P279* wd:Q11344 }
  } AS %items WHERE { 
    INCLUDE %items .
    OPTIONAL { ?item wdt:P279 ?broader } .
    {
      SELECT ?item (count(distinct ?element) as ?instances) {
        INCLUDE %items.
        OPTIONAL { ?element wdt:P31 ?item }
      } GROUP BY ?item
    }
    {
      SELECT ?item (count(distinct ?site) as ?sites) {
        INCLUDE %items.
        OPTIONAL { ?site schema:about ?item }
      } GROUP BY ?item
    }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en"
    }
  }

(The INCLUDE keyword used in this query is a Blazegraph and Anzo extension to the SPARQL standard.) Combining this -s option with other options, such as -i to include instances or -d to include item descriptions, shows what SPARQL query the tool would generate to retrieve this additional information. It’s a great opportunity to learn more about SPARQL, about the Wikidata data model, and about their relationship. (I have worried that this data model would scare off people who are new to SPARQL–that if their first data set to query was Wikidata, they migh think that the complexity of the necessary queries was because of SPARQL and not because of Wikidata–but when I see all the great activity on Twitter around the use of SPARQL with Wikidata lately, I don’t worry so much anymore.)

The ability to get at the generated SPARQL queries is also a huge help to my original goal of retrieving triples that let me store an RDFS/OWL ontology or a SKOS taxonomy about Wikipedia entities. I can change the SELECT part to a CONSTRUCT clause to create triples that use the variables bound in wdtaxonomy’s WHERE clauses. wdtaxonomy (or rather, Jakob) has done the difficult work of assembling the necessary query logic and we can just take it and use it.

Some of the other command line options I liked include -U to get full URIs and -r to get superclasses of the named entity instead of its subclasses. I encourage everyone interested in SPARQL and Wikidata to install wdtaxonomy and start playing with it. Especially with that -s option!

Extracting RDF data models from Wikidata

Bob DuCharme — Sun, 18 Nov 2018 09:41:46 -0500

That's "models", plural.

Their avoidance of the standard model vocabularies is not a big deal, and we should be glad that they make this available in RDF at all.

Some people complain when an RDF dataset lacks a documented data model. A great thing about RDF and SPARQL is that if you want to know what kind of modeling might have been done for a dataset, you just look, even if they’re using non-(W3C-)standard modeling structures. They’re still using triples, so you look at the triples.

If I know that there is an entity x:thing23 in a dataset, I’m going to query for {x:thing23 ?p ?o} and see what information there is about that entity. Hopefully I will find an rdf:type triple saying that it’s a member of a class. If not, maybe it uses some other home-grown way to indicate class membership; either way, you can then start querying to find out about the class’s relationships to properties and other classes, and you’ve got a data model. What if it doesn’t use RDFS to describe these modeling structures and their relationships? A CONSTRUCT query will convert it to a data model that does.

And, if {x:thing23 ?p ?o} triples don’t indicate any class membership, just seeing what the ?p values are tells you something about the data model. If certain entities use certain properties for their predicates, and other entities use a list that overlaps with that, you’ve learned more about relationships between sets of entities in the dataset. All of these things can be investigated with simple queries.

Wikidata offers tons of great data and modeling for us RDF people, but it wasn’t designed for us. They created their own model and then expressed the model and instance data in RDF, and I’m not going to complain; can you imagine how cool it would be if Google did the same with their knowledge graph? (When I tweeted “Handy Wikidata hints for people who have been using RDF and SPARQL since before Wikidata was around: use wdt:P31 instead of rdf:type and wdt:P279 instead of rdfs:subClassOf”, Mark Watson replied that he liked my sense of humor. While I hadn’t meant to be funny I do appreciate his sense of humor.) As I’ve worked at understanding Wikidata’s documentation about their mapping to RDF I’ve had fun just querying around to understand the structures. Again: this is one of the key reasons that RDF and SPARQL are great! Because we can do that!

Last month I described how you can find the subclass tree under a given class in Wikidata and since then I’ve done further exploration of how to pull data models out of Wikidata. Note that I say “models” and not “model”. Olivier Rossel recently referred to extracting the data model of Wikidata (my translation from his French), but I worry that looking for “the” grand RDF data model of Wikidata might set someone up for disappointment. I think that looking for data models to suit various projects will be more productive. (Olivier and I discussed this further in the “Handy Wikidata hints” thread mentioned above.)

The following query builds on the one I did last month to either get a class tree below a given one or to get its superclasses instead. It creates triples that express the classes and their relationships using W3C standard properties.

CONSTRUCT {
  ?class a owl:Class . 
  ?class rdfs:subClassOf ?superclass . 
  ?class rdfs:label ?classLabel . 
  ?property rdfs:domain ?class . 
  ?property rdfs:label ?classLabel .
}
WHERE {
  BIND(wd:Q11344 AS ?mainClass) .    # Q11344 chemical element; Q1420 automobile

  
  # Pick one or the other of the following two triple patterns. 
  ?class wdt:P279* ?mainClass.     # Find subclasses of the main class. 
  #?mainClass wdt:P279* ?class.     # Find superclasses of the main class. 

  
  ?class wdt:P279 ?superclass .     # So we can create rdfs:subClassOf triples
  ?class rdfs:label ?classLabel.
  OPTIONAL {
    ?class wdt:P1963 ?property.
    ?property rdfs:label ?propertyLabel.
    FILTER((LANG(?propertyLabel)) = "en")
    }
  FILTER((LANG(?classLabel)) = "en")
}

(Because the query uses prefixes that Wikidata already understands, I didn’t need to declare any.) When run in the Wikidata query service form, there are too many triples to see at once, so I put the query into a subtreeClasses.rq file and ran it with curl from the command line like this:

curl --data-urlencode "query@subtreeClasses.rq" https://query.wikidata.org/sparql -H "Accept: text/turtle"  > chemicalElementSubClasses.ttl

Loading the result into TopBraid Composer Free edition (available here; the Free edition is a choice on the Product dropdown list) showed a class tree of the result like this:

(It’s tempting to add an entry for Frinkonium as a subclass of “hypothetical chemical element”.) I understand that the Wikimedia Foundation had their reasons for not describing their models with the standard vocabularies, but this shows the value of using the standards: interoperability with other tools. It also shows that the Foundation’s avoidance of the standard model vocabularies is not a big deal, and that we should be glad that they make this available in RDF at all, because the sheer fact that it’s in RDF makes it easy to convert to whatever RDF we want with a CONSTRUCT query. (Again, imagine if Google did this with any portion of their knowledge graph…)

The query above also looks for properties for those classes so that it can express those in the output with the RDFS vocabulary. It didn’t find many, but this bears further investigation. This query shows that in addition to the chemical element class having properties, there are constraints on those properties described with triples, so there’s a lot more that can be done here to pull richer models out of Wikidata and then express them in more standard vocabularies.

And of course there’s the possibility of pulling out instance data to go with these models. Queries for that would be easy enough to assemble but you might end up with so much data that Wikidata times out before giving it to you; you could use the techniques I described in Pipelining SPARQL queries in memory with the rdflib Python library to retrieve instance URIs and then retrieve the additional triples about those instances in batches of queries that use the VALUES keywords.

Lots of data instances of rich models, all transformed to conform to the W3C standards so that they work with lots of open source and commercial tools–the possibilities are pretty impressive. If anyone pulls datasets like this out of Wikidata for their field, let me know about it!

SPARQL full-text Wikipedia searching and Wikidata subclass inferencing

Bob DuCharme — Sun, 28 Oct 2018 12:37:19 -0500

Wikipedia querying techniques inspired by a recent paper.

I found all kinds of interesting things in the article “Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph”(pdf) by Stanislav Malyshev of the Wikimedia Foundation and four co-authors from the Technical University of Dresden. I wanted to highlight two particular things that I will find useful in the future and then I’ll list a few more.

Before I cover them, I wanted to mention that I’ve really grown to appreciate the little diamond icon in the upper-left of the Wikidata query form. As I refine queries on that form, the queries typically get messier and messier, so the ability to clean it all up with one click is very convenient.

Full text searching of Wikipedia with SPARQL

The paper’s “Custom SPARQL Extensions” section describes several extensions, including the MediaWiki Web API. The Wikidata Query Service/User Manual/MWAPI page describes how you can call the MediaWiki API search functions by using special property functions (that is, properties that instruct the query engine to execute certain special functions).

This API is definitely one of those topics where reviewing the examples will get you started more quickly than trying to read the actual documention. Their first SPARQL query search example, Find all entities with labels “cheese” and get their types, searches Wikipedia for entries that have “cheese” in one of their labels such as the page title or alternative names.

The key difference in the Find articles in Wikipedia example that follows the first cheese example is that its fifth line uses the property function mwapi:srsearch as a predicate instead of mwapi:search, telling the query to search the contents of all of the English (note the “.en” on the fourth line) Wikipedia pages. You can try that example yourself to do a full-text search for “cheese”. I did a similar search for Darius Milhaud Burt Bacharach because I’ve recently been fascinated by the connections between Milhaud, a French composer who rose to prominence in the 1920s as a member of Les Six, and Bacharach, one of the greatest pop songwriters of the 1960s. (Listening to some Milhaud once, it struck me as odd that his use of horns would remind me of some Bacharach songs and arrangements until I found out that the author of “The Look of Love”, “Walk on By”, and “I Say a Little Prayer” studied with Milhaud in the 1940s at McGill University.) This query certainly doesn’t need the “LIMIT 20” at the end like the full-text search for “cheese” does, because these two guys don’t get mentioned on the same page as often as cheese gets mentioned, but it is an interesting set of pages.

Subclass inferencing with Wikidata

I’m still surprised at how many people use RDF without adding any schema information, or worse, without using schema information that’s already there. Wikidata provides plenty for us, and while the Blazegraph instance used as the back end to its SPARQL engine does not have its RDFS inferencing capabilities turned on–understandably, because queries that take advantage of this ask more of a processor and could therefore hamper scalability–a nice property path trick does let us ask for all the instances of a particular class and of its subclasses. This wasn’t even mentioned in the “Getting the Most out of Wikidata” paper, but a mention of how Wikidata uses owl:objectProperty inspired me to dig more into the use of the data modeling, and I came up with this.

The following (try it here) shows that Wikidata currently has data about 125 instances of home computer models:

SELECT (count(*) as ?instances) WHERE  {
  ?instance wdt:P31 wd:Q473708     # Instance has a type of "home computers"
}

This next query (try it here) shows that there are 28 instances of classes that are a direct subclass of “home computers”:

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279 wd:Q473708.     # wdt:P279: subclass of 
}

Merely adding the property path asterisk operator to wdt:P31 tells the query engine to find instances of the home computer class and also instances of any class in the subclass tree below it (try it here) and it finds 154 of them:

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279* wd:Q473708.
}

As with regular expressions, the asterisk means “0 or more steps away,” so that instances of wd:Q473708 would be counted along with instances of classes from its subclass tree. Using a plus sign instead would have meant “1 or more instances away” so that query would not have found instances of wd:Q473708.

The ability to use class relationships to identify potentially useful data is just one example of how schema metadata adds value to data. And, we get more than just these additional instances; we get additional class names that tell us more about these instances. For example, we can find that the Thomson MO5-CnAM 43737 computer is an instance of the class Thomson M05, which is a subclass of MOTO Gamme, which is a subclass of home computer.

And more

Some other nice things I learned about in the paper:

The use of wikibase:around and wikibase:box for additional kinds of geographic queries in addition to the ability to search within a city’s limits as I described in July.
A list of additional endpoints that you can use in federated queries sent to Wikidata.
Support for Blazegraph’s graph traversal features.
Multiple live Grafana dashboards about Wikidata usage such as data about agents and formats requested.

If you’re interested in SPARQL, Wikidata, or especially the combination, you’ll learn some fascinating things from this paper.

Panic over "superhuman" AI

Bob DuCharme — Sun, 23 Sep 2018 11:27:48 -0500

Robot overlords not on the way.

When someone describes their worries about AI taking over the world, I usually think to myself “I recently bookmarked a good article about why this is silly and I should point this person to it”, but in that instant I can’t remember what the article was. I recently re-read a few and thought I’d summarize them here in case anyone wants to point their friends to some sensible discussions of why such worries are unfounded.

The impossibility of intelligence explosion by François Chollet

Chollet is an AI researcher at Google and the author of the Keras deep learning framework and the Manning books “Deep Learning with Python” and “Deep Learning with R”. Like some of the other articles covered here, his piece takes on the idea that we will someday build an AI system that can build a better one on its own, and then that one will build a better one, and so on until the singularity.

His outline gives you a general idea of his line of reasoning; the bulleted lists in his last two sections are also good:

A flawed reasoning that stems from a misunderstanding of intelligence
Intelligence is situational
Our environment puts a hard limit on our individual intelligence
Most of our intelligence is not in our brain, it is externalized as our civilization
An individual brain cannot implement recursive intelligence augmentation
What we know about recursively self-improving systems
Conclusions

One especially nice paragraph:

In particular, there is no such thing as “general” intelligence. On an abstract level, we know this for a fact via the “no free lunch” theorem – stating that no problem-solving algorithm can outperform random chance across all possible problems. If intelligence is a problem-solving algorithm, then it can only be understood with respect to a specific problem. In a more concrete way, we can observe this empirically in that all intelligent systems we know are highly specialized. The intelligence of the AIs we build today is hyper specialized in extremely narrow tasks – like playing Go, or classifying images into 10,000 known categories. The intelligence of an octopus is specialized in the problem of being an octopus. The intelligence of a human is specialized in the problem of being human.

‘The discourse is unhinged’: how the media gets AI alarmingly wrong by Oscar Schwartz

This Guardian piece focuses on how the media encourages silly thinking about the future of AI. As the article’s subtitle tells us,

Social media has allowed self-proclaimed ‘AI influencers’ who do nothing more than paraphrase Elon Musk to cash in on this hype with low-quality pieces. The result is dangerous.

Much of the article focuses on the efforts of Zachary Lipton, a machine learning assistant professor at Carnegie Mellon, to call out bad journalism on the topic. One example is an article that I was also guilty of taking too seriously: Fast Company’s AI Is Inventing Languages Humans Can’t Understand. Should We Stop It? The actual “language” was just overly repetitive sentences made possible by recursive grammar rules, which I had experienced myself many years ago doing a LISP-based project for a Natural Language Processing course. Schwartz quotes the Sun article Facebook shuts off AI experiment after two robots begin speaking in their OWN language only they can understand as saying that the incident “closely resembled the plot of The Terminator in which a robot becomes self-aware and starts waging a war on humans”. (The Sun article also says “Experts have called the incident exciting but also incredibly scary”; according to the Guardian article, “These findings were considered to be fairly interesting by other experts in the field, but not totally surprising or groundbreaking”.)

Schwartz’s piece describes how the term “electronic brain” is as old as electronic computers, and how overhyped media coverage of machines that “think” as far back as the 1940s led to inflated expectations about AI that greatly contributed to the several AI winters we’ve had since then.

Ways to Think About Machine Learning by Benedict Evans

If you’re going to read only one of the articles I describe here all the way through, I recommend this one. I don’t listen to every episode of the a16z podcast, but I do listen to every one that includes Benedict Evans (this week’s episode, on Tesla and the Nature of Disruption, was typically excellent), and I have subscribed to his newsletter for years. He’s a sharp guy with sensible attitudes about how technologies and societies fit together and where it may lead.

One theme of many of the articles I describe here is the false notion that intelligence is a single thing that can be measured on a one-dimensional scale. As Evans puts it,

This gets to the heart of the most common misconception that comes up in talking about machine learning - that it is in some way a single, general purpose thing, on a path to HAL 9000, and that Google or Microsoft have each built *one*, or that Google ‘has all the data’, or that IBM has an actual thing called ‘Watson’. Really, this is always the mistake in looking at automation: with each wave of automation, we imagine we’re creating something anthropomorphic or something with general intelligence. In the 1920s and 30s we imagined steel men walking around factories holding hammers, and in the 1950s we imagined humanoid robots walking around the kitchen doing the housework. We didn’t get robot servants - we got washing machines.

Washing machines are robots, but they’re not ‘intelligent’. They don’t know what water or clothes are. Moreover, they’re not general purpose even in the narrow domain of washing - you can’t put dishes in a washing machine, nor clothes in a dishwasher (or rather, you can, but you won’t get the result you want). They’re just another kind of automation, no different conceptually to a conveyor belt or a pick-and-place machine. Equally, machine learning lets us solve classes of problem that computers could not usefully address before, but each of those problems will require a different implementation, and different data, a different route to market, and often a different company. Each of them is a piece of automation. Each of them is a washing machine.

After bringing up relational databases as a point of comparison for what new technology can do (“Relational databases gave us Oracle, but they also gave us SAP, and SAP and its peers gave us global just-in-time supply chains - they gave us Apple and Starbucks”), he asks “What, then, are the washing machines of machine learning, for real companies?” He offers some good suggestions, some of which can be summarized as “AI will allow the automation of more things”.

He also discusses low-hanging fruit for what new things AI may automate. As an excellent followup to that, I recommend Kathryn Hume’s Harvard Business Review article How to Spot a Machine Learning Opportunity, Even If You Aren’t a Data Scientist.

The Myth of a Superhuman AI by Kevin Kelly

In this Wired Magazine article by one of their founders, after a discussion of some of the panicky scenarios out there we read that “buried in this scenario of a takeover of superhuman artificial intelligence are five assumptions which, when examined closely, are not based on any evidence”. He lists them, then lists five “heresies [that] have more evidence to support them”; these five provide the structure for the rest of his piece:

Intelligence is not a single dimension, so “smarter than humans” is a meaningless concept.
Humans do not have general purpose minds, and neither will AIs.
Emulation of human thinking in other media will be constrained by cost.
Dimensions of intelligence are not infinite.
Intelligences are only one factor in progress.

A good point about how artificial general intelligence is not something to worry about makes a nice analogy with artificial flight:

When we invented artificial flying we were inspired by biological modes of flying, primarily flapping wings. But the flying we invented – propellers bolted to a wide fixed wing – was a new mode of flying unknown in our biological world. It is alien flying. Similarly, we will invent whole new modes of thinking that do not exist in nature. In many cases they will be new, narrow, “small,” specific modes for specific jobs – perhaps a type of reasoning only useful in statistics and probability.

(This reminds me of Evans writing “We didn’t get robot servants - we got washing machines”.) Another good metaphor is Kelly’s comparison of attitudes about superhuman AI with cargo cults:

It is possible that superhuman AI could turn out to be another cargo cult. A century from now, people may look back to this time as the moment when believers began to expect a superhuman AI to appear at any moment and deliver them goods of unimaginable value. Decade after decade they wait for the superhuman AI to appear, certain that it must arrive soon with its cargo.

19 A.I. experts reveal the biggest myths about robots by Guia Marie Del Prado

This Business Insider piece is almost three years old but still relevant. Most of the experts it quotes are actual computer scientist professors, so you get much more sober assessments than you’ll see in the panicky articles out there. Here’s a good one from Berkeley computer scientist Stuart Russell:

The most common misconception is that what AI people are working towards is a conscious machine, that until you have a conscious machine there’s nothing to worry about. It’s really a red herring.

To my knowledge, nobody, no one who is publishing papers in the main field of AI, is even working on consciousness. I think there are some neuroscientists who are trying to understand it, but I’m not aware that they’ve made any progress.

As far as AI people, nobody is trying to build a conscious machine, because no one has a clue how to do it, at all. We have less clue about how to do that than we have about build a faster-than-light spaceship.

From Pieter Abbeel, another Berkeley computer scientist:

In robotics there is something called Moravec’s Paradox: “It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility”.

This is well appreciated by researchers in robotics and AI, but can be rather counter-intuitive to people not actively engaged in the field.

Replicating the learning capabilities of a toddler could very well be the most challenging problem for AI, even though we might not typically think of a one-year-old as the epitome of intelligence.

I was happy to see the article quote NYU’s Ernie Davis, whose AI class I took over 20 years ago while working on my master’s degree there. (Reviewing my class notebook I see a lot of LISP and Prolog code, so things have changed a lot.)

This article implicitly has a nice guideline for when to take predictions about the future of AI seriously: are they computer scientists familiar with the actual work going on lately? If they’re experts in other fields engaging in science fiction riffing (or as the Guardian article put it more cleverly, paraphrasing Elon Musk), take it all with a big grain of salt.

I don’t mean to imply that the progress of technologies labeled as “Artificial Intelligence” has no potential problems to worry about. Just as automobiles and chain saws and a lot of other technology invented over the years can do harm as well as good, the new power brought by advanced processors, storage, and memory can be misused intentionally or accidentally, so it’s important to think through all kinds of scenarios when planning for the future. In fact, this is all the more reason not to worry about sentient machines: as the Guardian piece quotes Lipton, “There are policymakers earnestly having meetings to discuss the rights of robots when they should be talking about discrimination in algorithmic decision making. But this issue is terrestrial and sober, so not many people take an interest.” Sensible stuff to keep in mind.

Pipelining SPARQL queries in memory with the rdflib Python library

Bob DuCharme — Mon, 27 Aug 2018 08:55:23 -0500

Using retrieved data to make more queries.

Last month in Dividing and conquering SPARQL endpoint retrieval I described how you can avoid timeouts for certain kinds of SPARQL endpoint queries by first querying for the resources that you want to know about and then querying for more data about those resources a subset at a time using the VALUES keyword. (The example query retrieved data, including the latitude and longitude, about points within a specified city.) I built my demo with some shell scripts, some Perl scripts, and a bit of spit and glue.

I started playing with RDFLib’s SPARQL capabilities a few years ago as I put together the demo for Driving Hadoop data integration with standards-based models instead of code. I was pleasantly surprised to find out how easily it could run a CONSTRUCT query on triples stored in memory and then pass the result on to one or more additional queries, letting you pipeline a series of such queries with no disk I/O. Applying these techniques to replace my shell scripts and Perl scripts from last month showed me that these same techniques could be used for all kinds of RDF applications.

When I was at TopQuadrant I got to know SPARQLMotion, their (proprietary) drag-and-drop system for pipelining components that can do this sort of thing. RDFLib offers several graph manipulation methods that can extend what I’ve done here to do many additional SPARQLMotion-ish things. When I recently asked about other pipeline component-based RDF development tools out there, I learned of Linked Pipes ETL, Karma, ld-pipeline, VIVO Harvester, Silk, UnifiedViews, and a PoolParty framework around Unified Views. I hope to check out as many of them as I can in the future, but with the functions I’ve written for my new Python script, I can now accomplish so much with so little Python code that my motivation to go looking beyond that is diminishing–especially considering that when doing it this way, I have all of Python’s abilities to manipulate strings and data structures standing by in case I need them.

For me, the two most basic RDF tasks to augment the general Python capabilities are retrieval of triples from a remote endpoint for local storage and querying of locally stored triples. RDFLib makes the latter easy. For the former I was looking for a library, but Jindřich Mynarz pointed out that no specialized library was necessary; he even showed me the basic code to make it happen. (I swear I had tried a few times before posting the question on Twitter, so the brevity and elegance of his example were a bit embarrassing for me.)

You can find my new Python script to replace last month’s work on github. More than half of it is made up of the actual SPARQL queries being stored in variables. This is a good thing, because it means that the Python instructions (to retrieve triples from the endpoint, to load up the local graph with retrieved triples, to query that graph, and to build and then run new queries based on those query results) all together take up less than half of the script. In other words, the script is more about the queries than about the code to execute them.

The main part of the script isn’t very long:

# 1. Get the qnames for the geotagged entities within the city and store in graph g. 


queryRetrieveGeoPoints = queryRetrieveGeoPoints.replace("CITY-QNAME",cityQname)
url = endpoint + "?" + urllib.urlencode({"query": queryRetrieveGeoPoints})
g.parse(url)
logging.info('Triples in graph g after queryRetrieveGeoPoints: ' + str(len(g)))


# 2. Take the subjects in graph g and create queries with a VALUES clause 
#    of up to maxValues of the subjects. 


subjectQueryResults = g.query(queryListSubjects)
splitAndRunRemoteQuery("querySubjectData",subjectQueryResults,
                       entityDataQueryHeader,entityDataQueryFooter)


# 3. See what classes are used and get their names and those of their superclasses.
classList = g.query(listClassesQuery)
splitAndRunRemoteQuery("queryGetClassInfo",classList,
                       queryGetClassesHeader,queryGetClassesFooter)


# 4. See what objects need labels and get them.
objectsThatNeedLabel = g.query(queryObjectsThatNeedLabel)
splitAndRunRemoteQuery("queryObjectsThatNeedLabel",objectsThatNeedLabel,
                       queryGetObjectLabelsHeader,queryGetObjectLabelsFooter)


print(g.serialize(format = "n3"))   # (Actually Turtle, which is what we want, not n3.)

The splitAndRunRemoteQuery function was one I wrote based on my prototype from last month.

I first used RDFLib over 15 years ago, when SPARQL hadn’t even been invented yet. Hardcore RDFLib fans will prefer the greater efficiency of its native functions over the use of SPARQL queries, but my goal here was to have SPARQL 1.1 queries drive all the action, and RDFLib supports this very nicely. Its native functions also offer additional capabilities that bring it closer to some of the pipelining things I remember from SPARQLMotion. For example, the set operations on graphs let you perform actions such as unions, intersections, differences, and XORs of graphs, which can be handy when mixing and matching data from multiple sources to massage that data into a single cleaned-up dataset–just the kind of thing that makes RDF so great in the first place.

Picture by Michael Coghlan on Flickr (CC BY-SA 2.0)

Dividing and conquering SPARQL endpoint retrieval

Bob DuCharme — Sun, 22 Jul 2018 11:52:42 -0500

With the VALUES keyword.

When I first tried SPARQL’s VALUES keyword (at which point it was pretty new to SPARQL, having only recently been added to SPARQL 1.1) I demoed it with a fairly artificial example. I later found that it solved one particular problem for me by letting me create a little lookup table. Recently, it gave me huge help in one of the most classic SPARQL development problems of all: how to retrieve so much data from an endpoint that the first attempts at that retrieval resulted in timeouts.

The Wikidata:SPARQL query service/queries page includes an excellent Wikdata query to find latitudes and longitudes for places in Paris. You can easily modify this query to retrieve from places within other cities, and I wanted to build on this query to make it retrieve additional available data about those places as well. While accounting for the indirection in the Wikidata query model made this a little more complicated, it wasn’t much trouble to write.

The expanded query worked great for a city like Charlottesville, where I live, but for larger cities, the query was just asking for too much information from the endpoint and timed out. My new idea was to first ask for the roughly the same information that the Paris query above does, and to then request additional data about those entities a batch at a time with a series of queries that use the VALUES keyword to specify each batch. (I’ve pasted a sample query requesting one batch below.)

It worked just fine. I put all the queries and other relevant files in a zip file for people who want to check it out, but it’s probably not worth looking at too closely, because in a month or two I’ll be replacing it with a Python version that does everything more efficiently. It’s still worth explaining the steps in this version’s shell script driver file, because the things I worked out for this prototype effort–despite its Perl scripting and extensive disk I/O–mean that the Python version should come together pretty quickly. That’s what prototypes are for!

The driver shell script

Before running the shell script, you specify the Wikidata local name of the city to query near the top of the getCityEntities.rq SPARQL query file. (This is easier than it sounds–for example, to do it for Charlottesville, go to its Wikipedia page and click Wikidata item in the menu on the left to find that Q123766 is the local name.)

Once that’s done, running the zip file’s getCityData.sh shell script executes these main steps:

It uses a curl command to send the getCityEntities.rq CONSTRUCT query to the https://query.wikidata.org/sparql endpoint.The curl command saves the resulting triples in a file called cityEntities.ttl.
It uses ARQ to run the listSubjects.rq query on the new cityEntities.ttl file, specifying that the result should be a TSV file.
The results of listSubjects.rq get piped to a Perl script called makePart2Queries.pl. This creates a series of CONSTRUCT query files that ask Wikidata for data about entities listed in a VALUES section. It puts 50 entries in each file’s VALUES section; this figure of 50 is stored in a $maxLines variable in makePart2Queries.pl where it can be reset if the endpoint is still timing out. This step also adds lines to a shell script called callTempQueries.sh, where each line uses curl to call one of the queries that uses VALUES to request a batch of data.
getCityData.sh next runs the callTempQueries.sh shell script to execute all of these new queries, storing the resulting triples in the file tempCityData.ttl.
The tempCityData.ttl file has plenty of good data, but it can be used to get additional relevant data, so the script’s next line runs a query that creates a TSV file with a list of all of the classes found in tempCityData.ttl triples of the form {?instance wdt:P31 ?class}. (wdt:P31 is the Wikidata equivalent of rdf:type, indicating that a resource is an instance of a particular class.) That TSV file then drives the creation of a query that gets sent to the SPARQL endpoint to ask about the classes’ parent and grandparent classes, and that data gets added to tempCityData.ttl.
Another ARQ call in the script uses a local query to check for triple objects in the http://www.wikidata.org/entity/ namespace that don’t have rdfs:label values and get them–or at least, get the English ones, but it’s easy to fix if you want labels in different or additional languages.
The script runs one final ARQ query on tempCityData.ttl: the classic SELECT * WHERE {?s ?p ?o}. This request for all the triples actually tidies up the Turtle data a bit, storing all the triples with common subjects together. It puts the result in cityData.ttl.

One running theme of some of the shell script’s steps is the retrieval of labels associated with qnames. Wikidata has a lot of triples like {wd:Q69040 wd:P361 wd:Q16950} that are just three qnames, so retrieved data will have more value to applications if people and processes can find out what each qname refers to.

The main shell script has other housekeeping steps such as recording of the start and end times and deletion of the temporary files. I had more ideas for things to add, but I’ll save those for the Python version.

The Python version won’t just be a more efficient version of my use of VALUES to do batch retrievals of data that might otherwise time out. It will demonstrate, more nicely, something that only gets hinted at in this mess of shell and Perl scripts: the ability to automate the generation of SPARQL queries that build on the results of previously executed queries so that they can all work together as a pipeline to drive increasingly sophisticated RDF application development.

Here is a sample of one of the queries created to request data about one batch of entities within the specified city:

PREFIX p: <http://www.wikidata.org/prop/> 
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 


CONSTRUCT
{ ?s ?p ?o. 
  ?s ?p1 ?o1 . 
  ?s wgs84:lat ?lat . 
  ?s wgs84:long ?long .
  ?p rdfs:label ?pname .
  ?s wdt:P31 ?class .   
}
WHERE {
  VALUES ?s {
<http://www.wikidata.org/entity/Q42537129>
<http://www.wikidata.org/entity/Q30272197>
# about 48 more of those here...
}
  # wdt:P131 means 'located in the administrative territorial entity' .
  ?s wdt:P131+ ?geoEntityWikidataID .  
      ?s p:P625 ?statement . # coordinate-location statement
  ?statement psv:P625 ?coordinate_node .
  ?coordinate_node wikibase:geoLatitude ?lat .
  ?coordinate_node wikibase:geoLongitude ?long .


  # Reduce the indirection used by Wikidata triples. Based on Tommy Potter query
  # at http://www.snee.com/bobdc.blog/2017/04/the-wikidata-data-model-and-yo.html.
  ?s ?directClaimP ?o .                   # Get the truthy triples. 
  ?p wikibase:directClaim ?directClaimP . # Find the wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates.


  # the following VALUES clause is actually faster than just
  # having specific triple patterns for those 3 p1 values.
  ?s ?p1 ?o1 .
  VALUES ?p1 {
    schema:description
    rdfs:label        
    skos:altLabel
  }


  ?s wdt:P31 ?class . # Class membership. Pull this and higher level classes out in later query.

  
  # If only English names desired
  FILTER (isURI(?o1) || lang(?o1) = 'en' )
  # For English + something else, follow this pattern: 
  # FILTER (isURI(?o1) || lang(?o1) = 'en' || lang(?o1) = 'de')


  FILTER(lang(?pname) = 'en')
}

Neon sign picture by Jeremy Brooks on Flickr (CC BY-NC 2.0)

Running and querying my own Wikibase instance

Bob DuCharme — Sun, 17 Jun 2018 11:17:14 -0500

Querying it, of course, with SPARQL.

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. This dockerized version of Wikibase looks like a big step in this direction.

When Dario Taraborelli’s tweeted about how quickly he got a local wikibase instance and SPARQL endpoint up and running with wikibase-docker, he inspired me to give it a shot, and it was surprisingly easy and fun.

I have minimal experience with docker. As instructed by wikibase-docker’s README page, I installed docker and docker-compose. (When I got to the Test Docker Installation part of the Get Started, Part 1: Orientation and setup page for setting up docker, the hello-world app gave me a “permission denied” problem, but this solution described at Techoverflow solved it. I did have to reboot, as it suggested.)

Continuing along with the wikibase-docker README, when I clicked “http://localhost:8181” under Accessing your Wikibase instance and the Query Service UI it was pretty cool to see my own local running instance of the wiki:

Moving along in the README, I clicked “Create a new item” before I clicked “Create a new property”, but when I saw that the new item’s property list offered no choices, I realized that I should define some properties before creating any items. Properties and items can have names, aliases, and descriptions in a wide choice of spoken languages, and Wikibase includes a nice choice of data types.

After defining a property and creating items that had a value for that property, the “Query Service UI @ http://localhost:8282” link on the README led to a web form where I could enter a SPARQL query. I entered SELECT * WHERE { ?s ?p ?o} and saw the default triples that were part of the store as well as triples about the items and property that I had created.

The “Get an RDF dump from wikibase” docker command on the README page did just fine. Reviewing the triples in its output, I saw that the created entities fit the Wikidata data model described at Wikibase/DataModel/Primer, which I wrote about at The Wikidata data model and your SPARQL queries.

It took me some time (and a tweet) to realize that the “Query Service Backend (Behind a proxy)” URL listed on the README file was the URL for the SPARQL endpoint. The first query I tried after that worked with no problem:

curl http://localhost:8989/bigdata/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D

It was also easy to access this server from my phone across my home wifi when I substituted the machine’s name or IP address for “localhost” in the URLs above. The web interface was the same on a phone as on a big screen; the MediaWiki project’s Mobiles, tablets and responsive design manual page describes some options for extending the interface. If someone out there is looking for UI work and has some time on their hands, contributing some phone and tablet responsiveness to this open source project would be a great line on your résumé.

And finally, while the docker version of this is quick to get up and running, if you’re going far with your own MediaWiki installation, you’ll want to look over the Installation instructions for the regular, non-docker version.

After I did these experiments and wrote my first draft of this, I discovered the medium.com posting Wikibase for Research Infrastructure – Part 1 by Pratt Institute librarian and researcher Matt Miller. His piece describes a nice use case of following through on creating a Wikibase application and points to some handy Python scripts for automating the creation of classes and other structures from spreadsheets. His use case happens to be one of my favorite RDF-related available data sources: the Linked Jazz Project. I look forward to Part 2.

It’s great to have such a comprehensive system running on my local machine, complete with a web interface that lets non-RDF people create and edit any data they want and, for the RDF people, a SPARQL interface to let them pull and manipulate that data. For more serious dataset development, the MediaWiki project includes some helpful documentation about how to define your own classes and associated properties and forms. (July 20th note: that page is actually about Semantic MediaWiki, which I played around with a few years ago–apparently I didn’t keep my notes on that and Wikibase as organized as I should have.)

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. The dockerized version of Wikibase looks like a big step in this direction.

RDF* and SPARQL*

Bob DuCharme — Mon, 28 May 2018 09:36:59 -0500

Reification can be pretty cool.

After I posted Reification is a red herring (and you don’t need property graphs to assign data to individual relationships) last month, I had an amusingly difficult time explaining to my wife how that would generate so much Twitter activity. This month I wanted to make it clear that I’m not opposed to reification in and of itself, and I wanted to describe the fun I’ve been having playing with Olaf Hartig and Bryan Thompson’s RDF* and and SPARQL* extensions to these standards to make reification more elegant.

In that post, I said that in many years of using RDF I’ve never needed to use reification because, for most use cases where it was a candidate solution, I was better off using RDFS to declare classes and properties that reflected the use case domain instead of going right to the standard reification syntax (awkward in any standardized serialization) that let me create triples about triples. My soapbox ranting in that post focused on the common argument that the property graph approach of systems like Tinkerpop and Neo4j is better than RDF because achieving similar goals in RDF would require reification; as I showed, it doesn’t.

But, reification can still be very useful, especially in the world of metadata. (I am slightly jealous of the metadata librarians of the world for having the word “metadata” in their job title–it sounds even cooler in Canada: Bibliothécaire aux métadonnées.) If metadata is data about data, and more and more of the Information Science world is taking advantage of linked data technologies, then triples about triples are bound to be useful in their use of information for provenance, curation, and all kinds of scholarship about datasets.

The conclusion of my blog post mentioned how, just as I was finishing it up, I discovered Olaf Hartig and Bryan Thompson’s 2014 paper Foundations of an Alternative Approach to Reification in RDF and Blazegraph’s implementation of it. I decided to play with this a bit in Blazegraph in order to get a hands-on appreciation of what was possible, and I like it. (Olaf recently mentioned on Twitter that these capabilities are being added into Apache Jena as well, so this isn’t just a Blazegraph thing.)

As I described in Trying out Blazegraph two years ago, it’s pretty simple to download the Blazegraph jar, start it up, load RDF data, and query it. For my RDF* experiments, I started up Blazegraph and created a Blazegraph namespace with a mode of rdr and then did my first few experiments there.

I started with the examples in Olaf’s slides RDF* and SPARQL*: An Alternative Approach to Statement-Level Metadata in RDF. To make the slides visually cleaner, he left out full URIs and prefixes, so I added some to properly see the querying in action. I loaded his slide 15 data into my new Blazegraph namespace, specifying a format of Turtle-RDR. The double brackets that you see here are the RDF* extension that lets us create triples that are themselves resources that we can use as subjects and objects of other triples:

@prefix d: <http://www.learningsparql.com/ns/data/> .
<<d:Kubrick d:influencedBy d:Welles>> d:significance 0.8 ;
      d:source <https://nofilmschool.com/2013/08/films-directors-that-influenced-stanley-kubrick> .

This data tells us that the triple about Kubrick being influenced by Welles has a significance of 0.8 and a source at an article on nofilmschool.com.

I then executed the following query, based on Olaf’s from slide 16, with no problem:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x WHERE {
  <<?x d:influencedBy d:Welles>> d:significance ?sig .
  FILTER (?sig > 0.7)
}

In this case, the use of the double angle brackets is the SPARQL* extension that lets us do the same thing that this syntax does in RDF*. This query asks for whoever was named as being influenced by Welles in statements that have a significance greater than 0.7. The query worked just fine in Blazegraph.

SPARQL* also lets you query for the components of triples that are being treated as independent resources. From Olaf’s slide 17, this next query asks for whoever was influenced by Welles and the significance and source of any returned statements, and it worked fine with the data above:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?sig ?src WHERE {
  <<?x d:influencedBy d:Welles>> d:significance ?sig ;
  d:source ?src .
}

His slide 18 query returns the same result as that one, but takes the syntax a bit further by binding the triple pattern about someone influencing Welles to a variable and then querying for that:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?sig ?src WHERE {
  BIND(<<?x d:influencedBy d:Welles>> AS ?t)
  ?t  d:significance ?sig ;
      d:source ?src .
}

Moving on to more easy experiments, I found that all the examples on the Blazegraph page Reification Done Right worked exactly as shown there. That page also provides some nice background for ways to use RDF* and SPARQL* in Blazegraph.

Blazegraph lets you do inferencing, so I couldn’t resist mixing that with RDF* and SPARQL*. I had to create a new Blazegraph namespace that not only had a Mode of rdr but also had the “Inference” box checked upon creation, and then I loaded this data:

@prefix d:    <http://www.learningsparql.com/ns/data/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .


<<d:s1 d:p1 d:o1>> a d:Class2 .
<<d:s2 d:p2 d:o2>> a d:Class3 .


d:Class2 rdfs:subClassOf d:Class1 . 
d:Class3 rdfs:subClassOf d:Class1 .

It creates two triples that are themselves resources, with one being an instance of Class2 and the other being an instanced of Class3. Two final triples tell us that each of those classes are subclasses of Class1. The following query asked for triples that are instances of Class1, despite the data having no explicit triples about Class1 instances, and Blazegraph did the inferencing and found both of them:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?y ?z WHERE {
   <<?x ?y ?z>> a d:Class1 . 
}

After doing this inferencing, I was thinking that OWL metadata and inferencing about such triples should open up a lot of new possibilities, but I realized that none of those possibilities are necessarily new: they’ll just be easier to implement than they would have been using the old method of reification that used four triples to represent one. Still, being easier to implement counts for plenty, and I think that metadata librarians and other people doing work to build value around existing triples now have a reasonable syntax some nice tools to explore this.

Reification is a red herring

Bob DuCharme — Sun, 22 Apr 2018 10:14:15 -0500

And you don't need property graphs to assign data to individual relationships.

RDF's very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better.

I recently tweeted that the ZDNet article Back to the future: Does graph database success hang on query language? was the best overview of the graph database world(s) that I’d seen so far, and I also warned that many such “overviews” were often just Neo4j employees plugging their own product. (The Neo4j company is actually called Neo Technology.) The most extreme example of this is the free O’Reilly book Graph Databases, which is free because it’s being given away by its three authors’ common employer: Neo Technology! The book would have been more accurately titled “Building Graph Applications with Cypher”, the Neo4j query language. This 238-page book on graph databases manages to mention SPARQL and Gremlin only twice each. The ZDNet article above does a much more balanced job of covering RDF and SPARQL, Gremlin and Tinkerpop, and Cypher and Neo4j.

The DZone article RDF Triple Stores vs. Labeled Property Graphs: What’s the Difference? is by another Neo employee, field engineer Jesús Barrasa. It doesn’t mention Tinkerpop or Gremlin at all, but does a decent job of describing the different approach that property graph databases such as Neo4j and Tinkerpop take in describing graphs of nodes and edges when compared with RDF triplestores. Its straw man arguments about RDF’s supposed deficiencies as a data model reminded me of a common theme I’ve seen over the years.

The fundamental thing that most people don’t get about RDF, including many people who are successfully using it to get useful work done, is that RDF’s very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better. Just because RDF doesn’t require the use of schemas doesn’t mean that it can’t use them; the RDF Schema Language lets you declare classes, properties, and information about these that you can use to drive user interfaces, to enable more efficient and readable queries, and to do all the other things that people typically use schemas for. Even better, you can develop a schema for the subset of the data you care about (as opposed to being forced to choose between a schema for the whole data set or no schema at all, as with XML), which is great for data integration projects, and then build your schema up from there.

Barrasa writes of property graphs that “[t]he important thing to remember here is that both the nodes and relationships have an internal structure, which differentiates this model from the RDF model. By internal structure, I mean this set of key-value pairs that describe them.” This is the first important difference between RDF and property graphs: in the latter, nodes and edges can each have their own separate set (implemented as an array in Neo4j) of key-value pairs. Of course, nodes in RDF don’t need this; to say that the node for Jack has an attribute-value pair of (hireDate, “2017-04-12”), we simply make another triple with Jack as the subject and these as the predicate and object.

Describing the other key difference, Barrasa writes that while the nodes of property graphs have unique identifiers, “[i]n the same way, edges, or connections between nodes–which we call relationships–have an ID”. Property graph edges are unique at the instance level; if Jane reportsTo Jack and Jack reportsTo Jill, the two reportsTo relationships here each have their own unique identifier and their own set of key-value pairs to store information about each edge.

He writes that in RDF “[t]he predicate will represent an edge–a relationship–and the object will be another node or a literal value. But here, from the point of view of the graph, that’s going to be another vertex.” Not necessarily, at least for the literal values; these represent the values in RDF’s equivalent of the key-value pairs–the non-relationship information being attached to a node such as (hireDate, “2017-04-12”) above. This ability is why a node doesn’t need its own internal key-value data structure.

He begins his list of differences between property graphs and RDF with the big one mentioned above: “Difference #1: RDF Does Not Uniquely Identify Instances of Relationships of the Same Type,” which is certainly true. But, his example, which he describes as “an RDF graph in which Dan cannot like Ann three times”, is very artificial.

One of his “RDF workarounds” for using RDF to describe that Dan liked Ann three times is reification, in which we convert each triple to four triples: one saying that a given resource is an RDF statement, a second identifying the resource’s subject, a third naming the predicate, and a fourth naming the object. This way, the statement itself has identity, and we can add additional information about it as triples that use the statement’s identifier as a subject and additional predicates and objects as key-value pairs such as (time, “2018-03-04T11:43:00”) to show when a particular “like” took place. Barrasa writes “This is quite ugly”; I agree, and it can also do bad things to storage requirements.

In my 15 years of working with RDF, I have never felt the need to use reification. It’s funny how the 2004 RDF Primer 1.0 has a section on reification but the 2014 RDF Primer 1.1 (of which I am proud to be listed in the Acknowledgments) doesn’t even mention reification, because simpler modeling techniques are available, so reification was rarely if ever used.

By “modeling techniques” I mean “declaring and then using a model”, although in RDF, you don’t even have to declare it. If you want to keep track of separate instances of employees, or games, or buildings, you can declare any of these as a class and then create instances of it; similarly, if you want to keep track of separate instances of a particular relationship, declare a class for that relationship and then create instances of it.

How would we apply this to Barrasa’s example, where he wants to keep track of information about Likes? We use a class called Like, where each instance identifies who liked who. (When I first wrote that previous sentence, I wrote that we can declare a class called Like, but again, we don’t need to declare it to use it. Declaring it is better for serious applications where multiple developers must work together, because part of the point of a schema is to give everyone a common frame of reference about the data they’re working with.) The instance could also identify the date and time of the Like, comments associated with it, and anything else you wanted to add as a set of key-value pairs for each Like instance that is implemented as just more triples.

Here’s an example. After optional declarations of the relevant class and properties associated with it, the following has four Likes showing who liked who when and a “foo” value to demonstrate the association of arbitrary metadata with that Like.

@prefix d:    <http://learningsparql.com/ns/data/> .
@prefix m:    <http://learningsparql.com/ns/model/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 


# Optional schema.
m:Like  a rdfs:Class .          # A class...
m:liker rdfs:domain m:Like .    # and properties that go with this class.
m:liked rdfs:domain m:Like .
m:foo   rdfs:domain m:Like .


[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:43:00" ;
   m:foo "bar" .


[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:58:00" ;
   m:foo "baz" .


[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T12:04:00" ;
   m:foo "bat" .


[] a m:Like ;
   m:liker d:Ann ;
   m:liked d:Dan ;
   m:time "2018-03-04T12:06:00" ;
   m:foo "bam" .

Instead of making up specific identifiers for each Like, I made them blank nodes so that the RDF processing software will generate identifiers and keep track of them.

As to Barrasa’s use case of counting how many times Dan liked Ann, it’s pretty easy with SPARQL:

PREFIX d: <http://learningsparql.com/ns/data/> 
PREFIX m: <http://learningsparql.com/ns/model/>


SELECT (count(*) AS ?likeCount) WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann .
}

(This query would actually work with just the m:liker and m:liked triple patterns, but as with the example that I tweeted to Dan Brickley about, declaring your RDF resources as instances of classes can lay the groundwork for more efficient and readable queries.) Here is ARQ’s output for this query:

-------------
| likeCount |
=============
| 3         |
-------------

Let’s get a little fancier. Instead of counting all of Dan’s likes of Ann, we’ll just list the ones from before noon on March 3, sorted by their foo values:

PREFIX d: <http://learningsparql.com/ns/data/> 
PREFIX m: <http://learningsparql.com/ns/model/>


SELECT ?fooValue ?time WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann ;
        m:time ?time ;
        m:foo ?fooValue .
FILTER (?time < "2018-03-04T12:00")
}
ORDER BY ?fooValue

And here is ARQ’s result for this query:

------------------------------------
| fooValue | time                  |
====================================
| "bar"    | "2018-03-04T11:43:00" |
| "baz"    | "2018-03-04T11:58:00" |
------------------------------------

After working through a similar example for modeling flights between New York and San Francisco, Barrasa begins a sentence “Because we can’t create such a simple model in RDF…” This is ironic; the RDF model is simpler than the Labeled Property Graph model, because it’s all subject-predicate-object triples without the use of additional data structures attached to the graph nodes and edges. His RDF version would have been much simpler if he had just created instances of a class called Flight, because again, while the base model of RDF is the simple triple, more complex models can easily be created by declaring classes, properties, and information about those classes and properties–which we can do by just creating new triples!

To summarize, complaints about RDF that focus on reification are so 2004, and they are a red herring, because they distract from the greater power that RDF’s modeling abilities bring to application development.

A funny thing happened after writing all this, though. As part of my plans to look into Tinkerpop and Gremlin and potential connections to RDF as a next step, I was looking into Stardog and Blazegraph’s common support of both. I found a Blazegraph page called Reification Done Right where I learned of Olaf Hartig and Bryan Thompson’s 2014 paper Foundations of an Alternative Approach to Reification in RDF. If Blazegraph has implemented their ideas, then there is a lot of potential there. And if the Blazegraph folks brought this with them to Amazon Neptune, that would be even more interesting, although apparently that hasn’t shown up yet.

Album "Gin & Heptatonic" by my band The Heptatonic Jazz Quintet

Bob DuCharme — Sun, 25 Mar 2018 12:52:28 -0500

Now available on the big streaming services.

(I promise to go back to writing about RDF and related technology with my next entry, which is tentatively titled “Reification is a red herring: you don’t need property graphs to assign data to individual relationships.”)

Along with the jazz bass playing that I’ve been working on since 2003, I’ve written a few jazz tunes to try with the people I played with, so I recently got together some of my favorite local musicians and recorded an album of these songs. As soon as I told my wife that I planned to call the band “The Heptatonic Jazz Quintet” she suggested calling the album “Gin & Heptatonic”, and I couldn’t argue with that. (A heptatonic scale is a scale with seven notes, like most scales in Western music. And of course, beginning with “hep” makes it a great name for a jazz band. I was thrilled to grab the domain name heptatonic.com for only $12.) The music is mostly hard bop, swing, and variations on those.

My brother Peter produced the album and did the excellent Prestige and Blue Note-inspired front cover using a picture that I found in a Flickr search for Creative Commons CC BY 2.0 images. I did the back cover myself with a deep dive into GIMP. (On the topic of open source Linux-Windows-Mac software that played a role, I love the MuseScore scoring program and used it for lead sheets, MIDI demos, and horn arrangements.)

Two songs have lyrics. I knew that the album’s closing song “Let’s” required greater lyrical skills than I was capable of, so for that I called in my old New York music friend Philip Shelley. His illustrious musical career included the production of a demo of the last serious rock band I was in many years ago, and he wrote a song on the other demo. (You can read more about my limited New York rock career in an older blog entry.) Because no one in the quintet had any singing ambitions, for those two songs we got special guest Dick Orange, a popular local singer who specializes in “the great American songbook”, which generally means songs made famous by Frank Sinatra.

It was interesting to learn about the current infrastructure of getting music out where people can hear it. A former business partner of my brother’s recommended TuneCore, so I had them print a hundred CDs and, more importantly, take care of the music publishing administration and distribute the album to Spotify, Tidal, Amazon, Apple Music, iTunes, and other services. (I can’t provide you with Apple Music or iTunes links to the album; just search for “heptatonic” from inside of your favorite Apple walled garden.)

So if you like jazz, please check out the album and “Like” the band’s Facebook page. If you’re in the Charlottesville Virginia area on June 1st, come to our CD Release Party at Cville Coffee, which has wine and beer in addition to coffee.

And I promise: next I’ll go back to blogging about triples!

Playing jazz bass

Bob DuCharme — Sun, 25 Feb 2018 12:15:11 -0500

A brief crash course.

I enjoy writing short tutorials to get people started on something that may have seemed intimidating to them before, and I thought it might be fun to write up something that isn’t related to software but that I have thought a lot about in the last 15 years: jazz bass playing.

A few basic patterns that you can repeat over nearly any chord will get you pretty far. Any rock or classical bass player should be able to pick these up quickly. It should also work for any guitar player, because both electric and upright basses are tuned like the low four strings of a guitar. (Of course, the upright lacks frets, so you have to put your left hand’s fingers where the frets would be.) This crash course can be useful to keyboard players as well, who can treat it as a guide to what to play with their left hand for jazz tunes.

You can think of just about all jazz as being composed of 7th chords: major 7th, minor 7th, dominant 7th, and, less often, diminished seventh, or half diminished chords. These each consist of four notes, and the distances between the notes are what make them sound different–for example, the first two notes of a major 7th are a major third apart, and in a minor 7th they’re a minor third apart. Jazz musicians who see a three note triad chord like D minor may just add the seventh anyway, treating it as a D minor 7th. For a dominant 7th such as G7 in the key of C, jazz musicians since the advent of bebop in the 1940s sometimes add more notes to the chord such as the 9th, 11th, and 13th notes of the root note’s scale. They may even shift some of those added notes up or down a half step so that you see a fancy chord name like G#9. As a bass player, just think of that as a G7. To summarize, it’s simplest to think of it all as 7th chords.

There are some classic patterns that bass players typically play over these 7th chords, and if you learn a few of them and the notes of the chords, you can play simple jazz bass lines. Guitar players know that if they play the notes of an A minor 7th chord and then move their left hand one fret up the neck and do the same thing, they’ll be playing a Bb minor 7th, so learning how to play all the chords means learning only a few patterns that you can play up and down the neck. The same applies to these jazz bassline patterns.

A walking jazz bass line is nearly all quarter notes, so when you see “1357” below, for a given chord in a given bar played in 4/4 time, you would play these four notes as quarter notes: the root of the chord (the 1), the 3rd, the 5th, and the 7th. For example, over an A minor 7th chord, 1357 would mean playing A C E G.

For each of these patterns, we’ll look at how you would play them on the first four bars of the jazz standard Autumn Leaves. (Compare Nat King Cole’s version with Miles Davis’s; Miles’ fifty-second intro puts off the actual song a bit.)

1357

This is probably the most important pattern, but not the one you’ll use the most. It’s just an arpeggio of the chord–that is, the playing of each note of the 7th chord from the root up. It’s an important pattern to practice with any given song because it helps you to really understand the song’s structure. Over the first four bars of Autumn Leaves, this pattern would look like this on a bass staff (click the play button underneath it to hear the bass line with a piano and drums generated by the excellent open source scoring program MuseScore):

Repeating the same pattern for four bars is not something you’d want to do when playing with other people, but for this pattern it’s something worth doing for an entire song while practicing on your own because it helps you to get to know the song’s chords better.

1353

This one is so useful that I use it too often when I’m on automatic pilot. You can’t go wrong with it. I mentioned above that the main difference between a major seventh chord and a minor seventh chord is the “3” note; this pattern really brings that out while still hitting the most important notes of the chord from a bass player’s perspective–the root and the fifth–on the crucial first and third beats of the bar. Here it is over the start of Autumn Leaves:

1155

This seems almost too simple, but it sounds great if you give it a strong swing feel on a song like Duke Ellington’s Satin Doll. Here it is over Autumn Leaves:

1231

The 2nd note of the chord’s scale is not a chord tone, but here it leads to a chord tone on the crucial third beat. This is the first pattern we’ve seen that doesn’t always have either a 1 or a 5 on the first and third beat; the 3 on the third beat brings out the color of the chord more. In Autumn Leaves:

1235

Similar to the last one, and similarly useful. In Autumn Leaves:

1875

The 8 here really refers to the 1, but an octave higher. This is our first pattern with a 7th in it. In Autumn Leaves:

If you replace each quarter note in that with two swung eighth notes, you’d have a classic Chicago blues bass line, although major seventh chords don’t come up in Chicago blues very often:

(John Paul Jones’ bass line in Led Zeppelin’s How Many More Times is a variation on this: 1 8757 1 8 7 5.)

8753

Going down from the root of the chord through the chord’s other notes is also great. Again, you have the 1 (an octave higher this time) and the 5 on the first and third beat. In Autumn Leaves:

Half bars

Jazz songs typically have one chord per bar. There are songs ranging from I Got Rhythm (and the hundreds of songs based on it) to John Coltrane’s Giant Steps that are mostly two chords per bar, but in most jazz you’ll see one chord per bar with the occasional two-chord bar at the end of a four- or eight-bar phrase. If you play the chord notes 13, 15, or 85 over each half bar, you’ll be fine. Here are the first four bars of “I Got Rhythm” using 13 13 15 85 13 85 15 85:

Putting some together

Good bass playing mixes and matches these (and more) patterns. Below I’ve written out a bass line for the first eight bars of Autumn Leaves, labeling which of the patterns above is used in each bar:

Note how all the patterns listed above start with the root note of the chord. This is a solid, dependable thing to do, and greatly aids the jazz bass player’s job of showing the others what chord is being played. A step toward more advanced bass playing is getting away from this–for example, starting on the 3 or the 5 of the chord–while still making it clear to the rest of the group exactly which chord is happening. (They should already know, but still, you and the drummer and the piano or guitar player are providing the cake of which the other player’s solos are the frosting.)

Using more non-chord tones, the way 1232 and 1235 do above, is also a way to move past beginner status, as is moving beyond playing four quarter notes for every bar. As a first step to moving beyond the patterns above, try substituting 8 for 1 in more of the patterns, and try coming up with your own combinations of 1, 3, 5, 7, and 8. And, listen to great bass players. My favorites are Paul Chambers and Ray Brown, but if you listen to older, pre-bebop jazz, you’ll hear more of these simple patterns come up more often.

JavaScript SPARQL

Bob DuCharme — Sun, 28 Jan 2018 09:35:35 -0500

With rdfstore-js.

... all in the world's most popular programming language.

I finally had a chance to play with rdfstore-js by Antonio Garrote and it was all pretty straightforward. I already had node.js installed, so a simple npm install js installed his library. Then, I was ready to include the library in a JavaScript script that would read some RDF and query it with SPARQL. I just ran my script from the command line, but node.js fans know that they can take advantage of this library’s features in much more interesting application architectures. (Before I go on, I wanted to mention that after I tweeted yesterday that this blog entry was coming, Andy Seaborne reminded me about Apache Jena’s ability to load and run JavaScript functions. I tried the example from the feature’s home page and it worked great right out of the box.)

My sample script starts with a function I wrote for general-purpose output of SPARQL SELECT queries, then creates an rdfstore object and saves a query that will be used twice later in the script. After loading some RDF data about my book Learning SPARQL from the OCLC’s Worldcat online library catalog into the rdfstore, it runs the saved query against the loaded data to list ISBN numbers. The script then loads data about another book, runs the same query, and you can see the additional ISBN numbers in the new output.

// Utility function for outputting SELECT results
function outputSPARQLResults(results) {
    for (row in results) {
        printedLine = ''
        for (column in results[row]) {
            printedLine = printedLine + results[row][column].value + ' '
        }
        console.log(printedLine)
    }
}


// Create an rdfstore
var rdfstore = require('rdfstore') 


// Define a query to execute.
var listISBNs = 'PREFIX s: <http://schema.org/> \
PREFIX ls: <http://learningsparql.com/ns/data#> \
PREFIX wco: <http://www.worldcat.org/title/-/oclc/> \
PREFIX wci: <http://worldcat.org/isbn/> \
SELECT ?isbn \
FROM ls:g1 WHERE { ?book s:isbn ?isbn } '


rdfstore.create(function(err, store) {   // no error handling

   
    store.execute(
        // Load data about the book Learning SPARQL into named graph g1 in the rdfstore.
        'LOAD <http://worldcat.org/oclc/890467322.ttl> \
        INTO GRAPH <http://learningsparql.com/ns/data#g1>', function(err) {


            store.setPrefix('s', 'http://schema.org/')
            store.setPrefix('ls', 'http://learningsparql.com/ns/data#')
            store.setPrefix('wco', 'http://www.worldcat.org/title/-/oclc/')
            store.setPrefix('wci', 'http://worldcat.org/isbn/')

           
        store.execute(listISBNs, function(err, results) {
                console.log("=== ISBN value ===")
                outputSPARQLResults(results)
        })
        }
    )


    store.execute(
        // Load data about the book "XML: The Annotated Specification" into the same graph
        'LOAD <http://worldcat.org/oclc/40768745.ttl> \
        INTO GRAPH <http://learningsparql.com/ns/data#g1>', function(err) {
        store.execute(listISBNs, function(err, results) {
                console.log("\n=== ISBN values after adding 2nd book's data ===")
                outputSPARQLResults(results)
        })
        }
    )

    
})

The script produces this output:

=== ISBN value ===
9781449371432 
1449371434 


=== ISBN values after adding 2nd book's data ===
9781449371432 
1449371434 
9780130826763 
0130826766

I loaded the data into a named graph because the library documentation’s sample query for loading remote data did. I briefly tried loading the data into the default graph, but had no luck; I’m all for the use of name graphs, anyway. I also tried deleting triples from and inserting them into the g1 named graph and then querying again to see the results, and I didn’t have much luck there either (no error messages–I just didn’t see the query results I expected after the deletion and insertion) , but my minimal understanding of node.js asynchronous behavior was probably to blame. The library’s github page shows that it does support INSERT and DELETE queries.

I wouldn’t use this library’s triplestore for ongoing production maintenance of a set of triples, anyway; I see it as a great lightweight way to grab triples from one or more sources and then perform SPARQL queries on those triples to look for subsets and patterns that can contribute to an application, all in the world’s most popular programming language.

The rdfstore-js github page also shows that it offers many ways to query and manipulate the loaded data that, for a JavaScript programmer, would be more direct. If Antonio’s ultimate goal was to bring RDF to JavaScript developers, I won’t complain; I’m just glad that he brought a useful JavaScript library to RDF (and SPARQL) developers.

SPARQL and Amazon Web Service's Neptune database

Bob DuCharme — Sun, 31 Dec 2017 09:53:14 -0500

Promising news for large-scale RDF development.

Amazon recently announced Neptune as an AWS service. As its home page describes it,

Amazon Neptune is a fast, scalable graph database service. Neptune efficiently stores and navigates highly connected data. Its query processing engine is optimized for leading graph query languages, Apache TinkerPop™ Gremlin and the W3C’s RDF SPARQL. Neptune provides high performance through the open and standard APIs of these graph frameworks. And, Neptune is fully managed, so you no longer need to worry about database management tasks such as hardware provisioning, software patching, setup, configuration, or backups.

Apart from the practical aspects of the scalable yet convenient use of RDF and SPARQL that Neptune will enable, it’s exciting to see such a high-profile acknowledgment of SPARQL as a serious development tool. Many organizations already knew this, but judging from the reaction to the Neptune announcement on Twitter, many more people are finally understanding this.

It's exciting to see such a high-profile acknowledgment of SPARQL as a serious development tool.

Rumors have been flying that the Blazegraph triplestore may play some role in Amazon’s new graph store. As Stardog CEO Kendall Clark wrote on ycombinator recently, “Amazon acquired the domains, etc. Many former Blazegraph engineers are now Amazon Neptune engineers according to LinkedIn, etc. It was rumored widely in the graph db world fwiw.” Yahoo Knowledge Graph science and data lead Nicolas Torzec responded to Kendall’s comment with a link showing that Amazon now owns the Blazegraph trademark. (Blazegraph’s website hasn’t shown much activity in a while, with the latest post on their Press page being from May of last year.)

May of last year was also when I wrote Trying out Blazegraph about my positive experiences about this graph store, and after the recent announcement I tweeted that if Blazegraph was part of Neptune, it would be very cool if that included Blazegraph’s inferencing. Pavel Klinov replied by pointing out a Neptune announcement video where they explicitly say that inferencing is not supported.

This hour-long “AWS re:Invent 2017: NEW LAUNCH! Deep dive on Amazon Neptune” video included some other interesting points. Because Neptune supports property graphs via Tinkerpop as well as SPARQL, early in the video the speaker provides some background on property graphs versus RDF. He devotes a good portion of his presentation to talking through an SQL query for people who are unfamiliar with graph databases and then covering comparable SPARQL and Tinkerpop Gremlin queries.

The plug from Thomson Reuters early in the video was nice to see, coming from a large well-known organization that has been taking SPARQL seriously for a while. Later in the video, one slide’s use of Thomson Reuter’s PermID vocabulary with the geonames vocabulary in the same triple was especially nice to see, because while the extent of RDF’s usage continues to be a pleasant surprise for me, I’m also surprised by how many people only use it for the simplicity of the triples data model–they’re missing the data integration power of the ability to mix and match the wide variety of existing vocabularies (and hence data sources) with their own data.

The video’s second speaker talks more about Neptune’s enterprise features such as fast failover, encryption at rest and in transit, and backup and restore, which are all great things to see in a cloud-based triplestore. Neptune offers a lot of room; as this speaker mentions, “Storage volumes are not required to be statically allocated; they actually grow automatically up to a maximum size of 64 terabytes.” The ability to restore a dataset to its state from a previous point in time also sounds very useful.

Once the speakers started taking questions, it looked to me like there were more questions about RDF and SPARQL than there were about Tinkerpop and Gremlin. The former included the question about inferencing, which got a response (as Pavel had pointed out to me) of “we do not have in-database inference currently… we are very interested in use cases for inferencing.” They also said that Neptune’s underlying graph engine was custom-built by Amazon as a graph system, which left me more curious about the potential role of Blazegraph in the released version of Neptune. (Maybe “by Amazon” includes former Blazegraph engineers.)

Some more interesting facts from the question and answer session:

Timeouts of SPARQL end points can be configured.
They have tested it with pretty close to a hundred billion triples. (Who remembers the Billion Triples Challenge?)
Neptune supports release 1.1 of SPARQL Query and Update, and the endpoint supports 1.1 of the SPARQL Protocol.
It supports named graphs, which will be particularly handy for managing multiple datasets when dealing with data at that scale.
While the preview configuration of Neptune does not allow federated SPARQL queries for security reasons, they “do see a lot of use cases for SPARQL federation.”
While Neptune currently doesn’t support “schema concepts or constraints in the graph schema,” it “is something that [they] have on their roadmap.” The Amazon rep first responded to this question by asking if the questioner was talking about something like SHACL; although they do not currently support this, just hearing him mention SHACL showed me that this great new standard is gaining some mindshare out there.

I’m looking forward to playing with SPARQL on AWS Neptune and will certainly be reporting back about my experiences here.

SPARQL queries of Beatles recording sessions

Bob DuCharme — Sun, 19 Nov 2017 10:40:53 -0500

Who played what when?

While listening to the song Dear Life on the new Beck album, I wondered who played the piano on the Beatles’ Martha My Dear. A web search found the website Beatles Bible, where the Martha My Dear page showed that it was Paul.

This was not a big surprise, but one pleasant surprise was how that page listed absolutely everyone who played on the song and what they played. For example, a musician named Leon Calvert played both trumpet and flugelhorn. The site’s Beatles’ Songs page links to pages for every song, listing everyone who played on them, with very few exceptions–for example, for giant Phil Spector productions like The Long and Winding Road, it does list all the instruments, but not who played them. On the other hand, for the orchestra on A Day in the Life, it lists the individual names of all 12 violin players, all 4 violists, and the other 25 or so musicians who joined the Fab Four for that.

An especially nice surprise on this website was how syntactically consistent the listings were, leading me to think “with some curl commands, python scripting, and some regular expressions, I could, dare I say it, convert all these listings to an RDF database of everyone who played on everything, then do some really cool SPARQL queries!”

So I did, and the RDF is available in the file BeatlesMusicians.ttl. The great part about having this is the ability to query across the songs to find out things such as how many different people played a given instrument on Beatles recordings or what songs a given person may have played on, regardless of instrument. In a pop music geek kind of way, it’s been kind of exciting to think that I could ask and answer questions about the Beatles that may have never been answered before.

Here are three typical triples. All of these resources have corresponding rdfs:label values to make query output look nicer:

t:HereComesTheSun i:Moogsynthesiser  m:GeorgeHarrison .
t:EleanorRigby    i:cello            m:NormanJones, m:DerekSimpson .

Here are some of the queries I entered.

Who ever played piano for the Beatles, and on how many songs?

PREFIX  i:     <http://learningsparql.com/ns/instrument/> 
PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 


SELECT ?pianistName (COUNT(?pianist) AS ?pianistCount)  WHERE {
  ?song i:piano ?pianist .
  ?pianist rdfs:label ?pianistName . 
}
GROUP BY ?pianistName
ORDER BY DESC(?pianistCount)

The result:

-------------------------------------
| pianistName        | pianistCount |
=====================================
| "Paul McCartney"   | 60           |
| "George Martin"    | 22           |
| "John Lennon"      | 16           |
| "John 'Duff' Lowe" | 2            |
| "Chris Thomas"     | 1            |
| "Kenny Powell"     | 1            |
| "Mal Evans"        | 1            |
| "Ringo Starr"      | 1            |
-------------------------------------

Paul’s number one spot is no surprise, and these results and other data do support any assertion that George Martin truly was the fifth Beatle. Seeing Chris Thomas there was a surprise to me; he went on to produce the Sex Pistols album, the first three Pretenders albums, the second through fifth Roxy Music albums, and many more classics. And, we have to wonder “what song had Ringo on piano?” As we’ll see, that was easy enough to query.

This variation on the query above is slightly broader, because it looks for people who played any instruments with the string “piano” in their name:

PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 


SELECT ?pianistName (COUNT(?pianist) AS ?pianistCount)  WHERE {
  ?instrument rdfs:label ?instrumentName . 
  FILTER(contains(?instrumentName,"piano"))
  ?song ?instrument ?pianist . 
  ?pianist rdfs:label ?pianistName . 
}
GROUP BY ?pianistName
ORDER BY DESC(?pianistCount)


-------------------------------------
| pianistName        | pianistCount |
=====================================
| "Paul McCartney"   | 67           |
| "George Martin"    | 22           |
| "John Lennon"      | 20           |
| "Billy Preston"    | 6            |
| "Chris Thomas"     | 2            |
| "John 'Duff' Lowe" | 2            |
| "Kenny Powell"     | 1            |
| "Mal Evans"        | 1            |
| "Nicky Hopkins"    | 1            |
| "Ringo Starr"      | 1            |
-------------------------------------

This raises Paul and John’s numbers and adds Nicky Hopkins (who also did important piano work for the Stones, the Kinks, and the Who) and Billy Preston, who in addition to the electric piano on Get Back, apparently played on five other songs. (The increase in numbers isn’t all from electric pianos, but also from the pianet that John and Paul each played once or twice.)

What song had Ringo on piano?

PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX  i:     <http://learningsparql.com/ns/instrument/> 
PREFIX  m:     <http://learningsparql.com/ns/musician/> 


SELECT ?song WHERE {
  ?songURI i:piano m:RingoStarr .
  ?songURI rdfs:label ?song .
}

The result is a White Album song that Ringo apparently wrote himself:

----------------------------
| song                     |
============================
| "Don't Pass Me By" |
----------------------------

Who were all the cellists the Beatles ever used, and on what songs?

PREFIX  i:     <http://learningsparql.com/ns/instrument/> 
PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 


SELECT?name ?songTitle WHERE {
  ?song i:cello ?musician .
  ?song rdfs:label ?songTitle .
  ?musician rdfs:label ?name . 
}
ORDER BY ?name

The result:

-------------------------------------------------------
| name                  | songTitle                   |
=======================================================
| "Alan Dalziel"        | "A Day In The Life"         |
| "Alan Dalziel"        | "She's Leaving Home"  |
| "Alex Nifosi"         | "A Day In The Life"         |
| "Allen Ford"          | "Within You Without You"    |
| "Bram Martin"         | "I Am The Walrus"           |
| "Dennis Vigay"        | "A Day In The Life"         |
| "Dennis Vigay"        | "She's Leaving Home"  |
| "Derek Simpson"       | "Eleanor Rigby"             |
| "Derek Simpson"       | "Strawberry Fields Forever" |
| "Eldon Fox"           | "Glass Onion"               |
| "Eldon Fox"           | "I Am The Walrus"           |
| "Eldon Fox"           | "Piggies"                   |
| "Francisco Gabarro"   | "A Day In The Life"         |
| "Francisco Gabarro"   | "Yesterday"                 |
| "Frederick Alexander" | "Martha My Dear"            |
| "Jack Holmes"         | "All You Need Is Love"      |
| "John Hall"           | "Strawberry Fields Forever" |
| "Lionel Ross"         | "All You Need Is Love"      |
| "Lionel Ross"         | "I Am The Walrus"           |
| "Norman Jones"        | "Eleanor Rigby"             |
| "Norman Jones"        | "Strawberry Fields Forever" |
| "Peter Beavan"        | "Within You Without You"    |
| "Peter Willison"      | "Blue Jay Way"              |
| "Reginald Kilbey"     | "Glass Onion"               |
| "Reginald Kilbey"     | "Martha My Dear"            |
| "Reginald Kilbey"     | "Piggies"                   |
| "Reginald Kilbey"     | "Within You Without You"    |
| "Terry Weil"          | "I Am The Walrus"           |
| "Uncredited"          | "Let It Be"                 |
-------------------------------------------------------

I have no reason to recognize any of the names here, but I when I sent the URL of the She’s Leaving Home page to a friend who was a London session string player in the sixties, he said that the members of the double string quartet on on that song were very top people and that some were friends of his.

Who played on how many songs, period?

PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 


SELECT ?playerName (COUNT(?player) AS ?playerCount)  WHERE {
  ?song ?instrument ?player . 
  ?player rdfs:label ?playerName . 
}
GROUP BY ?playerName
# don't bother with people who only played on one song
HAVING (COUNT(?player) > 1)        
ORDER BY DESC(?playerCount)

The result:

--------------------------------------
| playerName           | playerCount |
======================================
| "Paul McCartney"     | 678         |
| "John Lennon"        | 576         |
| "George Harrison"    | 502         |
| "Ringo Starr"        | 412         |
| "Uncredited"         | 58          |
| "George Martin"      | 45          |
| "Unknown"            | 28          |
| "Mal Evans"          | 16          |
| "Pete Best"          | 14          |
| "Billy Preston"      | 11          |
| "Tony Sheridan"      | 8           |
| "Chris Thomas"       | 6           |
| "John Underwood"     | 5           |
| "Neil Aspinall"      | 5           |
| "Sidney Sax"         | 5           |
| "Yoko Ono"           | 5           |
| "David Mason"        | 4           |
| "Jeff Lynne"         | 4           |
| "Reginald Kilbey"    | 4           |
| "Eldon Fox"          | 3           |
| "Eric Bowie"         | 3           |
| "Erich Gruenberg"    | 3           |
| "Harry Klein"        | 3           |
| "Henry Datyner"      | 3           |
| "Leon Calvert"       | 3           |
| "Neil Sanders"       | 3           |
| "Pattie Harrison"    | 3           |
| "Rex Morris"         | 3           |
| "Stuart Sutcliffe"   | 3           |
| "Alan Civil"         | 2           |
| "Alan Dalziel"       | 2           |
| "Andy White"         | 2           |
| "Bill Povey"         | 2           |
| "Brian Jones"        | 2           |
| "Colin Hanton"       | 2           |
| "Dennis Vigay"       | 2           |
| "Dennis Walton"      | 2           |
| "Derek Simpson"      | 2           |
| "Derek Watkins"      | 2           |
| "Eric Clapton"       | 2           |
| "Francisco Gabarro"  | 2           |
| "Fred Lucas"         | 2           |
| "Freddy Clayton"     | 2           |
| "Geoff Emerick"      | 2           |
| "Gordon Pearce"      | 2           |
| "Irene King"         | 2           |
| "Jack Greene"        | 2           |
| "Jack Rothstein"     | 2           |
| "John 'Duff' Lowe"   | 2           |
| "Johnnie Scott"      | 2           |
| "Jurgen Hess"        | 2           |
| "Keith Cummings"     | 2           |
| "Kenneth Essex"      | 2           |
| "Leo Birnbaum"       | 2           |
| "Lionel Ross"        | 2           |
| "Mahapurush Misra"   | 2           |
| "Marianne Faithfull" | 2           |
| "Mick Jagger"        | 2           |
| "Mike Redway"        | 2           |
| "Norman Jones"       | 2           |
| "Norman Lederman"    | 2           |
| "Norman Smith"       | 2           |
| "Other musicians"    | 2           |
| "Pat Whitmore"       | 2           |
| "Ralph Elman"        | 2           |
| "Ronald Thomas"      | 2           |
| "Stephen Shingles"   | 2           |
| "Tony Gilbert"       | 2           |
| "Tony Tunstall"      | 2           |
| "Tristan Fry"        | 2           |
| "Victor Spinetti"    | 2           |
--------------------------------------

No big surprises in the top 10 but there definitely are after that. For example…

What 4 Beatles tracks did ELO founder Jeff Lynne play on?

PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX  m:     <http://learningsparql.com/ns/musician/> 


SELECT ?instrument ?songName WHERE {
  ?song ?instrumentURI m:JeffLynne .
  ?song rdfs:label ?songName .
  ?instrumentURI rdfs:label ?instrument . 
}

Apparently he sang and played overdubs with Paul, George, and Ringo on some John demos, after John died, as “new” Beatle material to go with the Anthology documentary and rereleases.

---------------------------------------
| instrument       | songName         |
=======================================
| "backing vocals" | "Real Love"      |
| "guitar"         | "Real Love"      |
| "harmony vocals" | "Free As A Bird" |
| "guitar"         | "Free As A Bird" |
---------------------------------------

As you look through the big list of musicians above, you’ll probably want to plug more names into that last query. For example, any Beatles or Eric Clapton fan knows that he played the guitar solo on While My Guitar Gently Weeps, but why does he get a “2” up there? It turns out that he and some other big names sang backing vocals on All You Need Is Love.

Let me know what kinds of queries and results you come up with!

An HTML form trick to add some convenience to life

Bob DuCharme — Sun, 29 Oct 2017 10:07:24 -0500

With a little JavaScript as needed.

On the computers that I use the most, the browser home page is an HTML file with links to my favorite pages and a “single” form that lets me search the sites that I search the most. I can enter a search term in the field for any of the sites, press Enter, and then that site gets searched. The two tricks that I use to create these fields have been handy enough that I thought I’d share them in case they’re useful to others.

I quote the word “single” above because it appears to be a single form but is actually multiple little forms in the HTML. Here is an example with four of my entries; enter something into any of the fields and press Enter to see what I mean:

wikipedia

youtube dictionary whois

The first two fields search the way most search forms do, by passing a search string as a parameter to some back end process. To add one of these fields to my form, I just had to look at the source of the actual website’s search form to find out what variable it was passing to what URL and then reproduce that in a little form around that field in my home page file. For Wikipedia, I set the form’s action attribute to “http://en.wikipedia.org/wiki/Special:Search" and the input element’s name attribute to “search”. This way, if I enter “foobar” in my version of their search field above, the form creates the URL https://en.wikipedia.org/wiki/Special:Search?search=foobar to perform the search, and it works. (The input field of the Wikipedia field also has the autofocus field set to “autofocus” so that when a browser displays the page, the cursor is in that field, and I can then just press Tab a few times to quickly get to the others.) For YouTube there’s a different URL and the search parameter variable name is “search_query”, so I set the name attribute on that second little form’s input element to have that value.

The third and fourth input fields above search websites with a more RESTful interface, so instead of passing a value in a particular variable name to a CGI script, they just construct a URL with the search term at the end. From within the form, this is actually trickier than the CGI way to do it because some JavaScript must be embedded into the form’s action attribute to concatenate the entered value onto the appropriate URL and then send the browser to the resulting URL. You can see how this is done with a View Source of this blog entry. (Note how verbose the JavaScript way to grab that form value is–I’d appreciate any suggestions for a simpler way.) You’ll also see that to send the browser to the appropriate destination, the form sets the href property of the window.location object to the new URL.

Just about all the search forms I’ve found fall into one of these two categories, so for my master search forms at home and at work I’ve also added fields to search Google Maps, JIRA, Amazon, and more. You can see three more examples at the end of my entry from last April, The Wikidata data model and your SPARQL queries.

It all makes a nice example of doing a little fun scripting, instead of real work, to save upwards of minutes a day!

Understanding activation functions better

Bob DuCharme — Sun, 17 Sep 2017 13:11:14 -0500

And making neural networks look a little less magic.

Trying to get my data science and machine learning knowledge more caught up with my colleagues at CCRi, I have been regularly listening to the podcasts Talking Machines and Linear Digressions. One colleague recently recommended Data Skeptic, which I had tried before and didn’t get hooked on, but after listening to their episode on Activation Functions I am now hooked. I am so hooked that I am going back through their four-year history and listening to all of their episodes marked “[MINI]”; these are shorter episodes focused on single specific important concepts, like the activation function episode.

In my blog entry A modern neural network in 11 lines of Python last December, I quoted Per Harald Borgen’s Learning How To Code Neural Networks, where he says that backpropagation “essentially means that you look at how wrong the network guessed, and then adjust the networks weights accordingly.” I now understand better about a key design decision when making that adjustment. (All corrections to my explanations below are welcome.)

You can’t adjust the weights with just any old number. For one thing, they usually have to fit within a certain range. If your input value is between 1 and 5000 and the adjustment function expects a number between 0 and 1, you could divide the number by 5000 before passing it along, but that won’t give your model much help adjusting its future guesses. Division is linear, which means that if you plot a graph where the function’s inputs are the x values and the outputs for each input are the corresponding y values, the result is a straight line. (Some technical definitions of linearity consider that one to be overly simplified, but the Wikipedia entry is pretty close.) Combining linear functions just gives you another linear function, and a neural network’s goal is to converge on a value, which requires non-linearity. As Alan Richmond wrote in A Neural Network in Python, Part 2: activation functions, bias, SGD, etc., without non-linearity, “adding layers adds nothing that couldn’t be done with just one layer,” and those extra layers are what give deep learning its depth.

So, squeezing the input value down within a particular range won’t be enough. The sigmoid function that I described last December maps the input value to an S-curve so that greater positive values and lower negative values affect the output less than input values that are closer to 0. Ultimately, it does return a value between 0 and 1, and that’s what the 11-lines-of-Python network used to adjust weights in its earlier layer.

For some situations, through, instead of a value between 0 and 1, a value between -1 and 1 might be more useful–for example, if there is a potential need to adjust a weight downward. The hyperbolic tangent function also returns values that follow an S-curve, but they fall between -1 and 1. (While the regular tangent function you may have learned about in trigonometry class is built around a circle, the hyperbolic tangent function, or tanh, is built around a hyperbola. I don’t completely understand the difference, but when I look at a graph of the regular tangent function, I have a much more difficult time picturing how it would be helpful for tweaking a weight’s value.)

When you choose between a sigmoid function, a tanh function, and one of other alternatives mentioned below, you’re choosing an activation function. The best choice depends on what you’re trying to do with your data, and the knowlege of what each can do for you is an important part of the model-building process. (The need for this knowledge when building a machine learning model is one reason that machine learning cannot be commoditized as easily as many people claim; see the “MLaaS dies a second death” section of Bradford Cross’s Five AI Startup Preductions for 2017 for an excellent discussion of some related issues.)

The Data Skeptic podcast episode covers two other possible activation functions: a step function, which only outputs 0 or 1, and the Rectified Linear Unit (ReLU) function, which sets negative values to 0 and leaves others alone. ReLU activation functions come up in a Jupyter notebook that accompanied the CCRi blog entry Deep Learning with PyTorch in a Jupyter notebook that I wrote last May, and they also appear in an earlier, more detailed draft of a recent CCRi blog entry that I edited called Deep Reinforcement Learning-of how to win at Battleship. Both times, I had no idea what a ReLU function was. Now I do; maybe I am catching up with these colleagues after all.

If you want a better understanding of the choices developers make when designing neural network models to solve specific problems, I strongly recommend listening to the Data Skeptic podcast episode on activation functions, which is only 14 minutes. I especially liked its cornbread cooking examples, where questions of how much you might adjust the amount of different ingredients provided excellent examples of which activation functions would push numbers where you wanted them.

Images from Data Skeptic podcast page are CC-BY-NC-SA

Validating RDF data with SHACL

Bob DuCharme — Sun, 20 Aug 2017 10:54:36 -0500

Setting some constraints--then violating them!

Last month, in The W3C standard constraint language for RDF: SHACL, I described the history of this new standard that lets us define constraints on RDF data and an open source tool that lets us identify where such constraints were violated. The presence of the standard and tools that let us implement the standard will be a big help to the use of RDF in production environments.

There’s a lot you can do with SHACL–enough that the full collection of features available and their infrastructure can appear a bit complicated. I wanted to create some simple constraints for some simple data and then use the shaclvalidate.sh tool to identify which parts of the data violated which constraints, and it went very nicely.

I started by going through a TopQuadrant tutorial that builds some SHACL exercises using their TopBraid Composer GUI tool (free edition available by selecting “Free Edition” from the “Product” field on the TopBraid Composer Installation page). Then, after I examined the triples that Composer generated when I followed the tutorial’s steps, I created my own new example called employees.ttl to run with shaclvalidate.sh. (To make my example as stripped-down as possible, I used a text editor for this, not Composer.) You can download my file right here; below I describe the file a few lines at a time to show what I was doing and how the pieces fit together.

I started off with declarations for prefixes, a class, and a few properties for that class:

@prefix hr: <http://learningsparql.com/ns/humanResources#> .
@prefix d:  <http://learningsparql.com/ns/data#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .


#### Regular RDFS modeling ####


hr:Employee a rdfs:Class .


hr:name
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee .


hr:hireDate
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee ;
   rdfs:range xsd:date .


hr:jobGrade
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee ;
   rdfs:range xsd:integer .

There is nothing new and interesting there, but it’s worth reviewing why these declarations are useful: so that applications using instances of this class know more about it and can do more with it. For example, when generating a form to let users edit Employee instances, an application noting that hr:hireDate has an rdfs:range of xsd:date might provide a date-picking widget on the form instead of just providing a text field to fill out. (And, if the application sees an additional property for this class declared someday, it can automatically generate a field for the new property on the edit form, so that this model really is driving application behavior.) These rdfs:range values are not there so that an automated process can check whether that instance data conforms to these types, although some applications may have done that. This is the hole that SHACL fills, as we will see below.

  #### Additional SHACL modeling ####


hr:Employee
# Following two lines are an alternative to the line above
#hr:EmployeeShape
#  sh:targetClass hr:Employee ;
   a sh:NodeShape ;
   sh:property hr:nameShape ;
   sh:property hr:jobGradeShape .


hr:nameShape
   sh:path hr:name ;
   sh:datatype xsd:string ;
   sh:minCount 1 ;
   sh:maxCount 1 .


hr:jobGradeShape
   sh:path hr:jobGrade ;
   sh:datatype xsd:integer ;
   sh:minCount 1 ;
   sh:maxCount 1 ;
   sh:minInclusive 1;
   sh:maxInclusive 7 .

The SHACL vocabulary is associated here with the prefix sh:. Some of the best documentation of this vocabulary is right where it should be–in rdfs:comment values of the class and property declarations in https://www.w3.org/ns/shacl.ttl. (As we’ll see, the spec itself is also a good place to find out what’s what.)

Above, we see that hr:Employee, which had already been declared to be an rdfs:Class, is also declared to be an sh:NodeShape. To quote a few of the shacl.ttl vocabulary file’s rdfs:comment values, “a shape is a collection of constraints that may be targeted for certain nodes,” “a node shape is a shape that specifies constraint [sic] that need to be met with respect to focus nodes,” and (quoting the spec this time) “an RDF term that is validated against a shape using the triples from a data graph is called a focus node.” So, declaring hr:Employee to also be a sh:NodeShape lets it serve as a collection of constraints for certain nodes.

Note the commented-out alternative lines after that first one. Instead of making the existing hr:Employee class also serve as a collection of constraints for instances of that class, we could declare a separate new class as an instance of sh:NodeShape (in the commented-out example, a new instance called hr:EmployeeShape) and go on to define the constraints there. How would the validator know that hr:EmployeeShape was storing constraints for the hr:Employee class? Because, as the last commented-out line shows, its sh:targetClass property would point to the hr:Employee class. (Thanks to my former TopQuadrant colleague Holger for helping me to understand how that works.)

After naming the place to store the constraints, we create some using the SHACL vocabulary’s sh:property property. The rdfs:comment for this property in shacl.ttl tells us that it “Links a shape to its property shapes.” In the SHACL files created by TopBraid Composer, it links to property shapes grouped together with blank nodes, but as you can see above, I pointed them at shapes for the Employee name and jobGrade properties that have their own URIs.

The hr:nameShape and hr:jobGradShape property shapes above are pretty self-explanatory. To show that one value for each must be included with each instance of hr:Employee, I gave each an sh:minCount and a sh:maxCount value of 1. The property shapes also have data types specified, and unlike the use of the rdfs:range specifications for these properties above, these will be used for validation. For hr:jobGradeShape, I also added sh:minInclusive and sh:maxInclusive values to restrict any data values to be from 1 to 7.

The last part of employees.ttl has four instances of hr:Employee. The first meets all the defined constraints:

d:e1
   a hr:Employee;
   hr:name "Barry Wom" ;
   hr:hireDate "2017-06-03" ;
   hr:jobGrade 6 .

When I comment out the other three instances and run shaclvalidate on the file, it gives me back a validation report, in the form of triples, about how everything is cool:

  @prefix sh:    <http://www.w3.org/ns/shacl#> .


[ a            sh:ValidationReport ;
  sh:conforms  true
] .

The next instance lacks the required hr:jobGrade value:

d:e2
   a hr:Employee;
   hr:name "Ron Nasty" ;
   hr:hireDate "2017-08-11" .

After I uncommented this instance in employees.ttl, shaclvalidate told me this about it:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix hr:    <http://learningsparql.com/ns/humanResources#> .


[ a            sh:ValidationReport ;
  sh:conforms  false ;
  sh:result    [ a                             sh:ValidationResult ;
                 sh:focusNode                  d:e2 ;
                 sh:resultMessage              "Less than 1 values" ;
                 sh:resultPath                 hr:jobGrade ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:MinCountConstraintComponent ;
                 sh:sourceShape                hr:jobGradeShape
               ]
] .

As I mentioned last month, returning these validation reports as triples makes it easier to plug the process into a larger automated workflow, and here we see that when constraints are violated, the triples include information to incorporate into that larger workflow–for example, to build a message to display in a pop-up message box. You could also query accumulated validation reports with SPARQL to identify patterns of what kinds of violations happened how often.

The third employee tests the SHACL validator’s ability to detect data type violations, because the hr:jobGrade value is not an integer:

  d:e3
   a hr:Employee;
   hr:name "Stig O'Hara" ;
   hr:hireDate "2017-03-14" ;
   hr:jobGrade 3.14 .

shaclvalidate does just fine with that:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix hr:    <http://learningsparql.com/ns/humanResources#> .


[ a            sh:ValidationReport ;
  sh:conforms  false ;
  sh:result    [ a                             sh:ValidationResult ;
                 sh:focusNode                  d:e3 ;
                 sh:resultMessage              "Value does not have datatype xsd:integer" ;
                 sh:resultPath                 hr:jobGrade ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:DatatypeConstraintComponent ;
                 sh:sourceShape                hr:jobGradeShape ;
                 sh:value                      3.14
               ]
] .

The last employee instance tests the SHACL validator’s ability to detect a value that falls outside of a specified range, because hr:jobGrade is greater than 7:

d:e4
   a hr:Employee;
   hr:name "Dirk McQuickly" ;
   hr:hireDate "2017-01-08" ;
   hr:jobGrade 8 .

This isn’t a problem either:

@prefix d:     <http://learningsparql.com/ns/data#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix hr:    <http://learningsparql.com/ns/humanResources#> .


[ a            sh:ValidationReport ;
  sh:conforms  false ;
  sh:result    [ a                             sh:ValidationResult ;
                 sh:focusNode                  d:e4 ;
                 sh:resultMessage              "Value is not <= 7" ;
                 sh:resultPath                 hr:jobGrade ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:MaxInclusiveConstraintComponent ;
                 sh:sourceShape                hr:jobGradeShape ;
                 sh:value                      8
               ]
] .

I deliberately picked simple examples to see how difficult they would be to implement, and as with many powerful software systems, my only problem was navigating the detailed documentation of the architecture and many features to find the parts that I wanted.

What other built-in constraints are available besides sh:datatype, sh:minCount, sh:maxCount, sh:minInclusive, and sh:maxInclusive? See for yourself in section 4 of the spec: Core Constraint Components. (For a nice quick skim of the available constraints, just look through that section’s entries in the spec’s table of contents.)

If you’ve done much work with RDF, you’re going to enjoy this.

1912 farm and garden supply catalog image courtesy of flickr

The W3C standard constraint language for RDF: SHACL

Bob DuCharme — Sun, 30 Jul 2017 10:46:06 -0500

A brief history of the new standard and some toys to play with it.

Many people have complained about how the Web Ontology Language, or OWL, wasn’t a very good constraint language for RDF data. They didn’t realize that it wasn’t designed to be a constraint language, in which you define the structure of a dataset as a guide to applications so that these applications know what to expect. OWL was designed to do other things, and we finally have the W3C standard RDF constraint language we’ve been waiting for, but before we discuss it, a little history puts it in better context.

For nearly all computer applications ever, there has been some ability to define what should be in the data, such as the columns of a relational table and their data types, the elements and attributes of a set of XML documents, or the classes of data that an object-oriented program is working with and their attributes. Data was usually not even added to these data sets until it conformed to the descriptions. These data definitions are known as prescriptive schemas, but OWL’s goal was to provide descriptive schemas: metadata about existing data sets, typically from the web, so that you could infer new knowledge about the resources that you found. (When I mention OWL, assume that I’m including its base layer RDFS as well.)

When building large RDF applications, though, prescriptive schemas can provide some benefits, and while OWL can do a bit of this, it can’t do it very well. And, the OWL tools that can check whether constraints have been violated are fairly big and heavy because of all of their additional inferencing capabilities. So people complained. (For a good overview of the cool things that OWL is currently being used for, see Jim Hendler’s 2016 ESWC talk.)

At my former employer TopQuadrant, principal engineer Holger Knublauch developed a triples-based constraint language for RDF called SPIN, for “SPARQL Inferencing Notation.” It took advantage of SPARQL’s ability to define constraints–basically, you would query for things you didn’t want to see in the data, like an invoice instance with no approvedBy value, and if you found any, you knew where the constraint was violated. SPIN provided a structure for storing these queries as metadata about a dataset, and it was very useful in TopQuadrant’s customer work. I wrote about it here in 2009 and in 2010.

It was so useful that eventually some of these customers, as well as TopQuadrant and some TopQuadrant colleagues at other companies, started a W3C working group to develop a new constraint language that built on the ideas of SPIN: the Shapes Constraint Language, or SHACL. (Get it? “Shackle”? Constraints?) SHACL is now a Recommendation: an official W3C standard just like HTML, XML, CSS, and the RDF standards.

The TopQuadrant page An Overview of SHACL Features and Specifications gives a nice overview of the components of SHACL and their relationships, and I will be digging deeper into that in the coming weeks.

Some more great recent SHACL news is the availability of an API with command line tools to try SHACL out. I’ve been playing with the shaclvalidate.sh shell script tool (a Windows batch file is also included) which reads a file of triples that include constraints and instance data and then lists any violated constraints. A form-based SHACL playground is also available; I took the sample constraints and the Turtle version of the sample data available on that page, combined them into a single file, and fed that file to shaclvalidate.sh. The validation report that it created pointed out that the schema:Person instance’s death date was earlier than its birth date, thereby violating one of the defined constraints. These reports are themselves sets of triples, making this kind of validation easier to plug this validation process into a larger workflow.

The open source SHACL API that Holger created is available on github. The week that he released the command line tools I was actually trying to code up a SHACL command line validator myself around the API (with much kind help to my atrophied Java skills from Andy Seaborne), so I was very glad to see Holger release something that saved me from further Java coding.

Holger’s API includes many test cases that I know will teach me a lot about SHACL’s capabilities. For example, one test demonstrates the ability to define a constraint with a SPARQL query, one of the original inspirations for this constraint language, and I have already successfully run this test with the validation shell script. SPARQL-based constraints are less necessary in SHACL than you might think, because the core of SHACL is a vocabulary to define many common constraint conditions, but it’s still great to see them, because they add so much flexibility to the constraints that you can define–for example, you could specify that an approvedBy value is only required if invoiceAmount is greater than a certain value.

I’m looking forward to playing more with SHACL. A good next step for anyone interested is to review the slides titled Shaping the Big Ball of Data Mud: W3C’s Shapes Constraint Language (SHACL) that TopQuadrant’s Richard Cyganiak gave to the Lotico Berlin Semantic Web meetup last November. I’ve copied his excellent conclusion slide above.

Creating Wide CSV files with SPARQL

Bob DuCharme — Sun, 25 Jun 2017 09:47:13 -0500

Lots of columns and commas, but all in the right place.

I was a bit proud that I came up with this simple way to make sure all the values came out in the right places in this fairly complicated target output.

I recently decided to copy my address book, which I have in an RDF file, to Google Contacts. The basic steps are pretty straightforward:

In Google Contacts, create an entry with test data in every field: TestGiveName, TestFamilyName, testemail@whatever.com, and so forth.
Export the contacts as a CSV file. The currently default “preview” version of Google Contacts doesn’t allow this yet, but you can “go to old version” and then find Export on the More drop-down menu.
In the exported CSV, move the test entry created in step 1 to the second line, just under the field names.
Using the field names and test entry as a guide, write a SPARQL query that returns the relevant information from the RDF address book file in the order shown in the exported file.
Execute the query, requesting CSV output.
Replace the query output’s header row with the header row from the original exported file and then import the result into Google contacts.

Step 4 seemed a bit intimidating. With something like 88 columns in step 2’s exported CSV, I knew that messing up one comma (for example, putting the 47th piece of information after the 47th comma instead of before it) would mess up all the information after it. I have made plenty of mistakes like this when creating wide-body CSV before.

I had a great idea, though, that made it much simpler: I created the SELECT statement from the first line of the exported CSV. I copied that line to a text editor, replaced the spaces in the field names with underscores, removed the hyphens (not allowed in SPARQL variable names), and then replaced each comma with a space and a question mark to turn the name after it into a variable name. Finally, I manually added a question mark to the very first name (the global replace in the previous step didn’t do that because there was no comma there) and added the word SELECT before it, and I had the SELECT statement that my query needed.

This way, before I’d even begun implementing the logic to pull each piece of data out of the address book RDF, I knew that when I did they would come out in the right places.

Adding two bits of that logic to a WHERE clause gave me this:

PREFIX  v: <http://www.w3.org/2006/vcard/ns#>


SELECT ?Name ?Given_Name ?Additional_Name ?Family_Name ?Yomi_Name
       ?Given_Name_Yomi ?Additional_Name_Yomi ?Family_Name_Yomi ?Name_Prefix
       ?Name_Suffix ?Initials ?Nickname ?Short_Name ?Maiden_Name ?Birthday
        # 21 more lines of variable names
       ?Custom_Field_2__Value ?Custom_Field_3__Type ?Custom_Field_3__Value
WHERE {
          ?entry v:family-name ?Family_Name . 
          ?entry v:given-name  ?Given_Name .
}

When I ran arq with this command,

arq --query addrbook2csv.rq --data addrbook.rdf --results=CSV

It gave me CSV output with the ?Family_Name and ?Given_Name values right where they needed to be for Google Contacts to import them properly.

I wish I could say that the rest of the query development was just a matter of adding triple patterns like the ?Family_Name and ?Given_Name ones shown above, but it got more complicated because of the ad hoc structure of my address book data. I needed a UNION, lots of OPTIONAL blocks, and even some nested OPTIONAL blocks that I’m not proud of. Still, I was a bit proud that I came up with this simple way to make sure that all the values came out in the right places in this fairly complicated target output.

Instead of writing SPARQL queries for Wikipedia--query for them!

Bob DuCharme — Mon, 29 May 2017 10:11:18 -0500

Queries as data to help you get at more data.

Let’s say, hypothetically, that you want to execute a SPARQL query that lists all of Wikimedia’s portraits with fruit. Wikimedia does have a category for this, so what would be the quickest way to come up with the query?

If you click the Wikidata item link on this category’s page, you’ll see all the data about it that you can retrieve with a SPARQL query to the Wikimedia endpoint, as I’ve described in my last few blog entries. The cool thing for this particular resource is that one property is called Wikidata SPARQL query equivalent, and its value is the query that will retrieve a list of the portraits with fruit. In other words, Wikidata has a triple that looks like this:

subject: wd:Q29789760 (the Wikidata category “portraits with fruit”)
predicate: p:P3921 (“Wikidata SPARQL query equivalent”)
object: SELECT DISTINCT ?item WHERE { ?item wdt:P31/wdt:P279\* wd:Q838948 . ?item wdt:P136/wdt:P31?/wdt:P279\* wd:Q134307 . ?item wdt:P180/wdt:P31?/wdt:P279\* wd:Q3314483 . }

Paste that object value into the Wikidata query service, and you can run it to get a list of the portraits.

That may seem like a lot of trouble to get this list, but that’s not really the point. This query gives you a head start in developing more sophisticated queries on the topic.

When I wondered how many Wikimedia resources used this predicate, I found that the ones using it were easier to understand if they also had an rdfs:label value. So, I entered this query to count the subjects that had both:

SELECT (count(*) as ?count) WHERE { 
  ?s wdt:P3921 ?o ;
     rdfs:label ?label .
  }

Two weeks ago there were 316, but as I write this there are almost a hundred more, so the number is growing at a good pace.

The idea of a SPARQL query as an object in an RDF triple is not new. It’s part of the Shapes Constraint Language (SHACL), as demonstrated by one of its test cases. SHACL is a W3C specification that lets you specify constraints on data–for example, to validate that certain properties are required for instances of a particular class and that others are optional. (This is a lot more difficult using OWL.) I’ll be looking at SHACL more closely in the coming months; meanwhile, I’ll be keeping an eye on the SPARQL queries being added to Wikidata where we can retrieve them with our own SPARQL queries.

The Wikidata data model and your SPARQL queries

Bob DuCharme — Sun, 23 Apr 2017 09:43:01 -0500

Reference works to get you taking advantage of the fancy parts quickly.

RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Last month I promised that I would dig further into the Wikidata data model, its mapping to RDF, and how we can take advantage of this with SPARQL queries. I had been trying to understand the structure of the data based on the RDF classes and properties I saw and the documentation that I could find, and some of the vocabulary discussing these issues confused me–for example, RDF is about describing resources, but I was seeing lots of references to entities, which can mean slightly different things in different branches of computer science. But, as Daniel Kinzler explained to me, “The Wikidata (or technically, Wikibase) data model is not defined in terms of RDF”; RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Wikibase, as described by its home page, “is a collection of applications and libraries for creating, managing and sharing structured data…Wikibase was developed for and is used by Wikidata, the free knowledge base and Wikipedia, the encyclopedia that anyone can edit.” The same page describes Wikidata as one of the “projects powered by Wikibase”, along with the europeana eagle project and Droid wiki.

The Wikibase/DataModel document is fairly long and detailed, and I would suggest starting instead with the Wikibase/DataModel/Primer. The Primer describes how “Entities are the basic elements of the knowledge base” and how “there are two predefined kinds of Entities: Items and Properties” (both of which RDF people consider to be resources). The document goes on to describe the information that can be associated with items and properties.

I had originally found their RDF Dump Format document abstruse and confusing, but it was easier to follow after I read the Wikibase data model primer because I had a better idea of the dump format’s basis. It’s even easier to follow if you just skim the Dump Format document to get a general idea of what it covers and then go to the Wikidata query service/User Manual, where you’ll get an even faster start querying Wikidata. (Their sample queries that I described last month also help a lot.) The User Manual describes the declared prefixes, some nice tricks for taking advantage of different kinds of labels, how to work with geo data, available endpoints that you can federate into your queries, and more. It also provides more context for understanding the Dump Format document.

The Data Model document describes the fundamental role of statements in the Wikibase data model. (Longstanding members of the RDF community will enjoy Kingsley Idehen’s continuation of my thread with Daniel, in which Kingsley insists that Wikidata is a collection of reified RDF statements, and Daniel says that, well, no, not really. They eventually agree to disagree.) The RDF Dump Format document describes two statement types that are important to how we treat Wikidata as an RDF repository but are also potentially very confusing. The first type is known as a truthy statement, or “direct claim”; these are simple triples that assert facts. The other statement type is the full statement, which is used to “represent all data about the statement in the system”.

As one way to quickly recognize the difference, Wikimedia usually uses specific namespaces in specific places in both truthy and full statements. For example, the namespace http://www.wikidata.org/prop/direct/, which is abbreviated using the prefix wdt:, is usually used for the predicate of a truthy statement. (The Dump format document has a nice list of all of these in the Predicates section. As you work with this data, you’ll often go back to the Prefixes used section of the RDF Dump Format and also the Full list of prefixes section that follows it.)

Here’s an example of the two kinds of statements that Daniel provided me: the triple {wd:Q64 wdt:P1376 wd:Q183} is a truthy triple saying that Berlin is the capital of Germany. Here is the full version of that statement:

wds:Q64-43CCD3D6-F52E-4742-B0E3-BCA671B69D2C a wikibase:Statement,
                 wikibase:BestRank ;
   wikibase:rank wikibase:PreferredRank ;
   ps:P1376 wd:Q183 ;
   prov:wasDerivedFrom wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 .

wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 a wikibase:Reference ;
   pr:P248 wd:Q451546 ;
   pr:P958 "Artikel 2 (1)" .

To understand this better, I wanted to see this for a different statement: the fact that bebop musician Tommy Potter played the bass. First, I clicked the “Wikidata item” link on Potter’s Wikipedia page and substituted /entity/ for /wiki/, as I described in my February blog entry Getting to know Wikidata, to get the URI that represents him: http://www.wikidata.org/entity/Q1369941.

However, after doing this, it wasn’t as simple as you might think to find the triple about the instrument he played. A query for {wd:Q1369941 ?p ?o} (using the prefix substitution for brevity) retrieves all the triples about him, but they’re the “truthy” ones, in which the predicates are known as direct claim predicates. Three of these triples described him as a Jazzbassist, a contrebassiste de jazz, and a contrabbassista statunitense, but none listed the “bass” as the instrument that he played in any language. Queries about the predicates themselves–that is, queries for triples where the properties used by these triples were the objects so that I could learn more about the truthy triples I retrieved about Potter (for example, whether the properties have rdfs: label values in different languages)–showed very little information. It turned out that, to learn more about these properties, I could look for triples that had these properties as objects, with a predicate of wikibase:directClaim linking the actual Wikidata data model property to the predicate used in the direct claim. When I queried for triples that had these Wikidata data model properties as subjects so that I could learn more about them, I found plenty.

To put these relationships to use, I entered the following query to find out more about Tommy Potter:

SELECT ?pname ?o ?olabel WHERE 
{
  wd:Q1369941 ?directClaimP ?o .          # Get the truthy triples.
  ?p wikibase:directClaim ?directClaimP . # Find the Wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates
  FILTER ( lang(?pname) = "en" )          # and their labels, in English.
  OPTIONAL {
     ?o rdfs:label ?olabel  
     FILTER ( lang(?olabel) = "en" )
  }
}

The result of this query is a mostly-human readable statement of facts about him. You could substitute the URI for just about any Wikidata entity as the subject in that first triple pattern to see information about that entity. You could also view the property names in other languages besides English, which is a big advantage of the Wikibase data model.

If you send your browser to the http://www.wikidata.org/entity/Q1369941 URI that represents Potter, you will get redirected to a Wikidata page with a nicely formatted human-readable version of data about Potter at https://www.wikidata.org/wiki/Q1369941. On the other hand, if you add .ttl (or .nt or .rdf) to the end of the /entity/ version of the URI, you’ll get RDF of all the data about Potter, including the full representations with triples that include predicates such as wikibase:BestRank and prov:wasDerivedFrom, just like the full version of the data above about Berlin being the capital of Germany.

After looking at the full data about Potter, some queries to find out more about it often found less than what I expected. I eventually learned from the WDQS data differences section of the RDF Dump Format document that “Data nodes (wdata:Q2) are not stored… This is done for performance reasons.”

After all this exploration, I still haven’t gotten to the kinds of structural queries I’ve been planning on–for example, looking for instances based on their class’s relationship(s) to other classes. The Stack Exchange question How to include sub-classes in a Wikidata SPARQL query?, which has a solid answer, looks pretty inspirational. I’m looking forward to playing with it.

Meanwhile, as you use SPARQL to play with Wikidata, you’re going to see a lot of cryptic resource names, like wdt:P279 in the Stack Exchange answer, and you’ll wonder what their human-readable name is. I created the form below to help me with the prefixes I used the most. You can use this form yourself (for example, enter P279 in the wdt: field and press Enter), but you’d probably be best off copying it from this page’s source into your own page that you can customize.

It turns out that wdt:P279 means “subclass of”. This is something I’ll certainly be getting to know better in the future.

wd:

wdt: p:

Wikidata's excellent sample SPARQL queries

Bob DuCharme — Sun, 26 Mar 2017 12:40:00 -0500

Learning about the data, its structure, and more.

Last month I finally got to know Wikidata more and saw that it has a lot of great stuff to explore. I’ve continued to explore the data and its model using two strategies: exploring the ontology built around the data and playing with the sample queries.

Exploring the ontology takes some work. I’ll describe the resources available for this (and the ontology!) in greater detail when I have a better handle on it all. For sample queries, I have my own queries that I use to explore a dataset, as I described in the “Exploring the Data” section of the Learning SPARQL chapter “A SPARQL Cookbook”, but the wise people behind Wikidata have done much better than this by giving us a page of sample queries that highlight some of the data and syntax available.

The sample queries range from simple to complex, and each has a “Try it!” link that loads the query into the query form. (Before you get too far into the list of queries, note that the RDF Dump Format documentation page, which I will describe more next time, has a list of the URIs represented by the prefixes in the queries.)

Here are some that I particularly liked after my brief tour:

The second example query, for data about Horses, is a good example of the excellent commenting that you will find in many of the sample queries.
The Recent Events query nicely demonstrates how Wikidata models time and how a query can use that to identify events with a particular time window–in this case of this sample query, between 0 and 31 days ago.
The Popular eye colors one demonstrates the use of Default views–special comments in directives that the Wikidata Query Service understands as an indication of how to present the data. The eye color query’s directive of “#defaultView:BubbleChart” means that running the query on https://query.wikidata.org will (quickly!) give you this:

Popular surnames among humans creates another nice bubble chart.
The Even more cats, with pictures query that follows the eye color one uses an ImageGrid defaultView to create the following, finally filling the gap between “SPARQL” and “cat pictures” that has bedeviled web technology for so long:

The remaining six defaultViews also look like a lot of fun.
The Children of Ghengis Khan sample query uses the Graph defaultView to display Khan’s children and grandchildren, with images of them when available, in a graph that lets you zoom and drag nodes around. A piece of it is shown above. The Music Genres query after that is similar. The line graph resulting from the Number of bands by year and genre query is also interesting.

After getting this far, I hadn’t even seen 10% of the sample queries, but I did find the answer to my original question about how to get to know the range of possibilities with SPARQL queries of Wikidata better. (One more nice sample query that I wanted to mention is not on the samples page but on the User Manual one: an example of Geospatial searches that lists airports within 100km of Berlin.)

To really learn about how Wikidata executes SPARQL queries, the SPARQL query service/query optimization page provides good background on how Blazegraph, the triplestore and query engine that Wikidata’s SPARQL endpoint uses, goes about executing the queries. (I found it pretty gutsy of this page’s authors to add a “Try it!” link after a sample query that the page itself says will time out.) As I wrote in the “Query Efficiency and Debugging” chapter of “Learning SPARQL”, query engines often optimize for you. Their methods for doing so are how these query engines try to distinguish themselves from each other, so learning more about the one that you’re using is worth it when you’re dealing with large-scale data like Wikidata. The “SPARQL query service/query optimization” page also describes how adding an explain keyword to the query URL will get you a report on how it parses and optimizes your query.

As much as I’d like to keep playing with of the sample queries, I’m going to dig into the Wikidata data model and its mapping to RDF next. Watch this space…

Getting to know Wikidata

Bob DuCharme — Sun, 26 Feb 2017 10:23:49 -0500

First (SPARQL-oriented) steps.

I’ve written so often about DBpedia here that a few times I considered writing a book about it. As I saw Wikidata get bigger and bigger, I kept postponing the day when I would dig in and learn more about this Wikipedia sibling project. I’ve finally done this, starting with a few basic steps and one extra fun one:

Learn how to hit the SPARQL endpoint from an operating system command line with curl
Explore, if available, the web form front end to the endpoint
Learn how to find the identifier for whatever I like (a band, a person, a concept) so that I can create queries about it
Automate the finding of the identifier when looking at a Wikipedia page

Wikidata SPARQL queries from the command line

For that first task, you can append an escaped version of your query to https://query.wikidata.org/sparql?query= and pass that to curl. For example, doing it with the query “SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 10” gives you this:

        curl https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D%20LIMIT%2010

That command line retrieves the result in the default XML format. curl’s -H option let’s you add HTTP header information to your request; for example adding ‘-H “Accept: text/csv’” after ‘curl’ on the command line above retrieves a CSV version of the result set instead of XML.

Web form front end for entering Wikidata SPARQL queries

https://query.wikidata.org/ is one of the nicest web forms I’ve ever seen for entering SPARQL queries. It offers color coding, auto-completion, and drop-down menus of tools, prefixes, and help.

When I enter a query like the one above into this form and click the Run button, the form runs the query and shows a URL in the browser’s address bar that incorporates the query. Pasting that full URL into another browser address bar takes me to the query form and enters that query (see this for an example), but doesn’t execute it the way DBpedia does in the same situation–with the Wikidata form, you still need to click that Run button. If anyone knows of some parameter that I can add to the Wikidata URL to make this happen, I’d love to hear about it; I could then use it to replace the delivery of the handful of JSON in the scriplet described below. March 4 update: I have learned from Jonas M. Kress that appending the escaped query to “https://query.wikidata.org/embed.html#” gives you a URL that will execute the query directly, like this.

Finding the identifier for a resource starting at its Wikipedia page

Feb 27 update: it looks like I went to a lot of unnecessary trouble when I should have paid closer attention to the Wikipedia pages themselves, which now have a “Wikidata item” link on the left. I learned about this from Raffaele Messuti, who also told me that a Ctrl+option+g keystroke will do the same thing. This keystroke combination didn’t work for me using a Das Keyboard under Ubuntu with either Chrome or Firefox, but may for you. The important thing is the nice link from every Wikipedia page to the corresponding Wikimedia page, although you’ll want to substitute “/entity/” for “/wiki/” in the Wikimedia URL to get the actual entity URI.

When viewing a Wikipedia page for something, you can usually find that thing’s DBpedia URI by rearranging the Wikipedia URL a little. Almost six years ago I automated this in a scriptlet that takes a browser from a Wikipedia page to the DBpedia URI for the page’s subject in one click.

The usage of the English terms from the Wikipedia URLs in the corresponding DBpedia URIs worked pretty well for a bottom-up, easily crowd-sourced bootstrapping of the DBpedia URI design, but the English basis and the problems introduced by the occasional use of punctuation are not ideal. The Wikidata team did more initial design of the URI structure and went with the best practice of not incorporating actual names. (My favorite explication of this practice is on slides 41 and 42 of this BBC slide deck.) For example, while the DBpedia URI for “house” is http://dbpedia.org/resource/House, the Wikidata one is http://www.wikidata.org/entity/Q3947.

So if we can’t go from a Wikipedia page to a Wikidata URI by manipulating a string version of the Wikipedia URL, how do we do it? The Wikibase/Indexing/RDF Dump Format page explains a lot about the structure of the data, and its Sitelinks section describes how a triple with a predicate of schema:about links a Wikipedia page to the Wikidata URI for the entity being described. If I want to know the URI for the concept of House and I know the concept’s Wikipedia URL, I can enter the query “SELECT ?uri WHERE { https://en.wikipedia.org/wiki/House schema:about ?uri }”. (You can try it in the Wikidata query form by clicking here.)

Automating that

To go from a Wikipedia page to a Wikidata URI in one click, I needed to embed a SPARQL query about the page’s schema:about value in a scriptlet that would send the query to the Wikidata SPARQL endpoint. (I would have liked to send it to the query form and execute that, but as I described above, I couldn’t work out how to trigger the running of the query from the submitted URL.) I did get this to work, and you can drag this link to your Chrome bookmarks bar: wp -> wikidata.

The scriptlet is a bit limited, though:

It returns a small handful of JSON instead of just the URI, which I would have preferred.
When used with Chrome, it displays the JSON in the browser. In a brief test with Firefox, the browser offered to download the JSON instead of displaying it.
I mentioned above how Wikipedia and DBpedia use English words in their URL identifiers, and this often includes disambiguation language, so the scriptlet doesn’t work on those. For example, adding the string “Asteroid” to the base URL “https://en.wikipedia.org/wiki/" will give you the Wikipedia URL for the English-language page describing minor planets, and if you’re looking at the Wikipedia page for that my new scriptlet will work just fine. However, if you add the string “Rock” to the same base URL, you get the URL for a Wikipedia disambiguation page. If you are viewing the Wikipedia page for Rock (geology), my scriplet’s little bit of string manipulation that constructs a SPARQL query to send to the Wikidata endpoint won’t have enough to go on.

The scriplet is about 180 characters of JavaScript that does the following:

For the current location in the browser (that is, the URL of the displayed Wikipedia page) replace any underscores with %2520. This is the escaped version of the escaped version of a space character, which I discovered is necessary through trial and error.
Escape the remainder of that URL as necessary.
Insert the result into a SPARQL query of the form SELECT ?uri WHERE {<escaped-url> schema:about ?uri}
Create a SPARQL endpoint GET request URL by appending all that to “https://query.wikidata.org/sparql?query=" and add “&format=json” at the end. (I tried “&format=csv” but instead of displaying the result Chrome offered to download it.)
Set location.href to the result. This “sends” the browser to the constructed URL, which should then display the result of the query in JSON.

Once I could find the URIs to represent the resources I ws interested in, it was time to start querying for information about them. In my next blog entry, I’ll talk about exploring Wikidata and its RDF-related resources with SPARQL. There are definitely some great features there.

Brand-name companies using SPARQL: the sparql.club

Bob DuCharme — Sun, 22 Jan 2017 09:37:51 -0500

Disney! Apple! Amazon! MasterCard!

Since I wrote “Experience in SPARQL a plus” about SPARQL appearances in job postings almost three years ago, I still find myself pointing people to it to show them that SPARQL is not some academic theoretical thing but a popular tool in production use at well-known companies.

On the job listing site indeed.com, I have a saved search for SPARQL mentions. The daily email of new search hits that this sends me typically lists a few entries for companies that I have heard of and some for companies that I haven’t. Every now and then I’ll pick out one to tweet about on @learningsparql, although I don’t do it nearly as often as I could.

Between this ongoing stream of new job postings, the increasing age of that blog posting, and my ownership (inspired by Paul Ford’s tilde.club) of the domain name sparql.club, I thought it would be fun to keep an updated list there so that I can point the SPARQL haters at it.

So the next time you see someone making ridiculous claims about SPARQL not catching on, tell them to check out the members of the sparql.club!

A modern neural network in 11 lines of Python

Bob DuCharme — Thu, 22 Dec 2016 07:52:56 -0500

And a great learning tool for understanding neural nets.

When you learn new technology, it’s common to hear “don’t worry about the low-level details–use the tools!” That’s a good long-term strategy, but when you learn the lower-level details of how the tools work, it gives you a fuller understanding of what they can do for you. I decided to go through Andrew Trask’s A Neural Network in 11 lines of Python to really learn how every line worked, and it’s been very helpful. I had to review some matrix math and look up several numpy function calls that he uses, but it was worth it.

My title here refers to it as a “modern neural network” because while neural nets have been around since the 1950s, the use of backpropagation, a sigmoid function and the sigmoid’s derivative in Andrew’s script highlight the advances that have made neural nets so popular in machine learning today. For some excellent background on how we got from Frank Rosenblatt’s 1957 hard-wired Mark I Perceptron (pictured here) to how derivatives and backpropagation addressed the limitations of these early neural nets, see Andrey Kurenkov’s A ‘Brief’ History of Neural Nets and Deep Learning, Part 1. The story includes a bit more drama than you might expect, with early AI pioneers Marvin Minsky and Seymour Papert convincing the community that limitations in the perceptron model would prevent neural nets from getting very far. I also recommend Michael Nielsen’s Using neural nets to recognize handwritten digits, in particular the part on perceptrons, which gives further background on that part of Kurenkov’s “Brief History,” and then Nielsen’s sigmoid neurons part that follows it and describes how these limitations were addressed.

Andrew’s 11-line neural network, with its lack of comments and whitespace, is more for show. The 42-line version that follows that is easier to follow and includes a great line-by-line explanation. Below are some of my own additional notes that I made as I dissected and played with his code. Often, I’m just restating something he already wrote but in my own words to try to understand it better. Hereafter, when I refer to his script, I mean the 42-line one.

I took his advice of trying the script in an IPython (Jupyter) notebook, where it was a lot easier to change some numbers (for example, the number of iterations in the main for loop) and to add print statements that told me more about what was happening to the variables through the training step iterations. After playing with this a bit and reviewing his piece again, I realized that many of my experiments were things that he suggests in his bulleted list that begins with “Compare l1 after the first iteration and after the last iteration.” That whole list is good advice for learning more about how the script works.

Beneath his script and above his line-by-line description he includes a chart explaining each variable’s role. As you read through the line-by-line description, I encourage you to refer back to that chart often.

I have minimal experience with the numpy library, but based on the functions from Andrew’s script that I looked up, it seems typical that if you take a numpy function that does something to a number and pass it a data structure such as an array or matrix filled with numbers, it will do that thing to all the numbers and return the data structure.

Line 23 of Andrew’s script initializes the weights that tell the neural net how much attention to pay to the input at each neuron. Ultimately, a neural net’s job is to tune these weights based on what it sees in how input (in this script’s case, the rows of X) corresponds to output (the values of y) so that when it later sees new input it will hopefully output the right things. When this script starts, it has no idea what values to use as weights, so it puts random values in, but not completely random–as Andrew writes, they should have a mean of 0. The np.random.random((x,y)) function returns a matrix of x rows of y random numbers between 0 and 1, so 2*np.random.random((3,1)) returns 3 rows with 1 number each between 0 and 2, and the “- 1” added to that makes them random numbers between -1 and 1.

np.dot() returns dot products. I found the web page How to multiply matrices (that is, how to find their dot product) helpful in reviewing something I hadn’t thought about in a while. You can reproduce that page’s “Multiplying a Matrix by a Matrix” example using numpy with this:

matrix1 = np.array([[1,2,3],[4,5,6]])
matrix2 = np.array([[7,8],[9,10],[11,12]])
print(np.dot(matrix1,matrix2))

The four lines of code in Andrew’s main loop perform three tasks:

predict the output based on the input (l0) and the current set of weights (syn0)
check how far off the predictions were
use that information to update the weights before proceeding to the next iteration

If you increase the number of iterations, you’ll see that first step get closer and closer to predicting an output of [[0][0][1][1]] in its final passes.

Line 29 does its prediction by calculating the dot product of the input and the weights and then passing the result (a 4 x 1 matrix like [[-4.98467345] [-5.19108471] [ 5.39603866] [ 5.1896274 ]], as I learned from one of those extra print statements I mentioned) to the sigmoid function named nonlin() that is defined at the beginning of the script. If you graphed the values potentially returned by this function, they would not fall in a line (it’s “nonlinear”) but along an S (sigmoid) curve. Looking at the Sigmoid function Wikipedia page shows that the expression 1/(1+np.exp(-x)) that Andrew’s nonlin() function uses to calculate the function’s return value (if the optional deriv parameter has a value of False) corresponds to the formula shown near the top of the Wikipedia page. This nonlin() function takes any number and returns a number between 0 and 1; as Andrew writes, “We use it to convert numbers to probabilities.” For example, if you pass a 0 to the function (or look at an S curve graph) you’ll see that the function returns .5; if you pass it a 4 or higher it returns a number very close to 1, and if you pass it a -4 or lower it returns a number very close to 0. The np.exp() function used within that expression calculates the exponential of the passed value–or all the values in an array or matrix, returning the same data structure. For example, np.exp(1) returns the natural logarithm e, which is about 2.718.

Line 29 calls that function and stores the returned matrix in the l1 variable. Reviewing the variable chart, this is the “Second Layer of the Network, otherwise known as the hidden layer.” Line 32 then subtracts the l1 matrix from y (the array of answers that it was hoping to get) and stores the difference in l1_error. (Subtracting matrices follows the basic pattern of np.array([[5],[4],[3]]) - np.array([[1],[1],[1]]) = np.array([[4],[3],[2]]).)

Remember how line 23 assigned random values to the weights? After line 32 executes, the l1_error matrix has clues about how to tune those weights, so as the comments in lines 34 and 35 say, the script multiplies how much it missed (l1_error) by the slope of the sigmoid at the values in l1. We find that slope by passing l1 to the same nonlin() function, but this time, setting the deriv parameter to True to get that slope. (See “using the derivatives” in Kurenkov’s A ‘Brief’ History for an explanation of why derivatives played such a big role in helping neural nets move beyond the simple perceptron models.) As Andrew writes, “When we multiply the ‘slopes’ by the error, we are reducing the error of high confidence predictions” (his emphasis). In other words, we’re putting more faith in those high confidence predictions when we create the data that will be used to update the weights.

The script stores the result of multiplying the error by the slope in the l1_delta variable and then uses the dot product of that and l0 (from the variable table: “First Layer of the Network, specified by the input data”) to update the weights stored in syn0.

Per Harald Borgen’s Learning How To Code Neural Networks (which begins with an excellent description of the relationship of a neuron’s inputs to its weights and goes on to talk about how useful Andrew’s “A Neural Network in 11 lines of Python” is) says that backpropagation “essentially means that you look at how wrong the network guessed, and then adjust the networks weights accordingly.” When someone on Quora asked Yann LeCun (director of AI research at Facebook and one of the Three Kings of Deep Learning) “Which is your favorite Machine Learning Algorithm?” his answer was a single eight-letter word: “backprop.” Backpropagation is that important to why neural nets have become so fundamental in so many modern computer applications, so the updating of syn0 in line 39 is very crucial here.

And that’s it for the neural net training code. After the first iteration, the weighting values in syn0 will be a bit less random, and after 9,999 more iterations, they’ll be a lot closer to where you want them. I found that adding the following lines after line 29 gave me a better idea of what was happening in the l1 variable at the beginning and end of the script’s execution:

   if (iter < 4 or iter > 9997):
        print("np.dot(l0,syn0) at iteration " + str(iter) + ": " + str(np.dot(l0,syn0)))
        print("l1 = " + str(l1))

(One note for people using Python 3, like I did: in addition to adding the parentheses in calls to the print function, the main for loop had to say just “range” instead of “xrange”. More on this at stackoverflow.)

These new lines showed that after the second iteration, l1 had these values, rounded to two decimal place here: [[ 0.26] [ 0.36] [ 0.23] [ 0.32 ]]. As Andrew’s output shows, at the very end, l1 equals [[ 0.00966449] [ 0.00786506] [ 0.99358898] [ 0.99211957]], so it got a lot closer to the [0,0,1,1] that it was shooting for. How can you make it get even closer? By increasing the iteration count to be greater than 10,000.

For some real fun, I added the following after the script’s last line, because if you’re going to train a neural net on some data, why not then try the trained network (that is, the set of tuned weights) on some other data to see how well it performs? After all, Andrew does write “All of the learning is stored in the syn0 matrix.”

X1 = np.array([ [0,1,1], [1,1,0], [1,0,1],[1,1,1] ])  
x1prediction = nonlin(np.dot(X1,syn0))
print(x1prediction)

The first two rows of my new input are different from those in the training data. The xlprediction variable ended up as [[ 0.00786466] [ 0.9999225 ] [ 0.99358931] [ 0.99211997]], which was great to see. Rounded, these are 0, 1, 1, and 1, so the neural net knew that for those first two rows of data–which it hadn’t seen before–the output should be the first value from each.

Everything I describe here is from part 1 of Andrew’s exposition, “A Tiny Toy Network.” Part 2, “A Slightly Harder Problem” has a script that is eight lines longer (four lines if you don’t count white space and comments) and I plan to dig into that next, because among other things, it has a more explicit demo of backpropagation.

Image courtesy of Wikipedia.

Pulling RDF out of MySQL

Bob DuCharme — Sun, 13 Nov 2016 10:09:53 -0500

With a command line option and a very short stylesheet.

When I wrote the blog posting My SQL quick reference last month, I showed how you can pass an SQL query to MySQL from the operating system command line when starting up MySQL, and also how adding a -B switch requests a tab-separated version of the data. I did not mention that -X requests it in XML, and that this XML is simple enough that a fifteen-line XSLT 1.0 spreadsheet can convert any such output to RDF.

I’ve written before about how tools like the open source D2RQ and Capsenta’s Ultrawrap provide middleware layers that let you send SPARQL queries to relational databases–and to combinations of relational databases from different vendors, which is where the real fun begins. This command line stylesheet trick gives you a simpler, more lightweight way to pull the relational data you want into an RDF file where you can use it with SPARQL or any other RDF tool.

If you have MySQL and xsltproc installed, you can do it all with a single command at the operating system prompt:

mysql -u someuser --password=someuserpw -X -e 'USE employees; SELECT * FROM employees LIMIT 5' | xsltproc mysql2ttl.xsl -

(Two notes about that command line: 1. don’t miss that hyphen at the very end, which tells xsltproc to read from standard in. 2. I added the LIMIT part for faster testing because the employees table has 30,024 rows. To come up with that number of 30,024, I had to look at my last blog entry to remember how to count the table’s rows, so writing out that quick reference has already paid off for me.) The XML returned by MySQL looks like this, with data from subsequent rows following a similar pattern:

  <resultset statement="SELECT * FROM employees LIMIT 5"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <row>
    <field name="emp_no">10001</field>
    <field name="first_name">Georgi</field>
    <field name="last_name">Facello</field>
    <field name="birth_date">1953-09-02</field>
    <field name="gender">M</field>
    <field name="hire_date">1986-06-26</field>
    <field name="department">Development</field>
  </row>

I thought the inclusion of the query as an attribute of the resultset attribute was a nice touch. The following XSLT stylesheet converts any such XML to Turtle RDF; you’ll want to adjust the prefix declarations to use URIs more appropriate to your data:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">


<xsl:output method="text"/>


<xsl:template match="resultset">
  @prefix v: <http://learningsparql.com/ns/myVocabURI/> . 
  @prefix d: <http://learningsparql.com/ns/myDataURI/> . 
      <xsl:apply-templates/>
    </xsl:template>


        <xsl:template match="row">
d:<xsl:value-of select="count(preceding-sibling::row) + 1"/> 
          <xsl:apply-templates/> . 
        </xsl:template>


    <xsl:template match="field">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>" ;
    </xsl:template>


</xsl:stylesheet>

The result includes some extra blank lines that I could suppress with xsl:text elements wrapping certain bits of the stylesheet, but a Turtle parser doesn’t care, so neither do I:

  d:1

    
      v:emp_no "10001" ;

    

    
      v:first_name "Georgi" ;

    

    
      v:last_name "Facello" ;

    

    
      v:birth_date "1953-09-02" ;

    

    
      v:gender "M" ;

    

    
      v:hire_date "1986-06-26" ;

    

    
      v:department "Development" ;

    
   .

You can customize the stylesheet for specific input data. For example, the URIs in your triple subjects could build on an ID value selected from the data instead of building on the position of the XML row element, as I did. As another customization, instead outputting all triple objects as strings, you could insert this template rule into the XSLT stylesheet to output the two date fields typed as actual dates, as long as you remembered to also add an xsd prefix declaration at the top of the spreadsheet:

    <xsl:template match="field[@name='birth_date' or @name='birth_date']">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>"^^xsd:date ;
    </xsl:template>

Or, you could leave the XSLT stylesheet in its generic form and convert the data types using a SPARQL query further down your processing pipeline with something like this:

PREFIX v: <http://learningsparql.com/ns/myVocabURI/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>


CONSTRUCT {
  ?row v:birth_date ?bdate ;
       v:hire_date ?hdate . 
}
WHERE {
  ?row v:birth_date ?bdateString ;
  v:hire_date ?hdateString . 
  BIND(xsd:date(?bdateString) AS ?bdate)
  BIND(xsd:date(?hdateString) AS ?hdate)
}

However you choose to do it, the nice thing is that you have lots of options for grabbing the massive amounts of data stored in the many MySQL databases out there and then using that data as triples with a variety of lightweight, open source software.

My SQL quick reference

Bob DuCharme — Sun, 30 Oct 2016 11:49:36 -0500

Pun intended.

I sometimes go many months with no need to use SQL, so over the years I’ve developed my own quick reference to remind me how to do basic tasks when necessary. Most SQL quick reference sheets out there try to pack as much different syntax as they can in a small space, but mine focuses on what the basic tasks are and how to do them. I hope that someone finds it useful.

Most of my SQL experience has been with MySQL, and I separated what I believe are the standard SQL parts below from the MySQL-specific ones. Corrections welcome. If you really want to know where SQL implementations differ from the standard, Comparison of different SQL implementations is an excellent, detailed reference on what’s different from one implementation to another.

I tested all the SELECT commands shown with the MySQL employee sample database that I downloaded from github.

(I also later converted this to be the SQL page for the wonderful Learn X in Y Minutes site; that page has since been translated to Spanish, Italian, Russian, Turkish, and Chinese!)

Standard SQL

Enter these at the SQL command line. I don’t think semicolons are necessary after every one of these commands, but I find it simplest to just always add them. SQL is not case-sensitive about keywords, and I tend to enter them in lower-case, but I’m showing them in the conventional upper-case here because it makes it easier to distinguish them from database, table, and column names.

quit to return to the operating system command line	`quit;`
list available databases	`# comments start with a pound sign` `SHOW DATABASES;`
select the database named `employees` to use	`USE employees;`
create a new database called `someDatabase`	`# database and table names are case-sensitive` `CREATE DATABASE someDatabase;`
delete database `someDatabase`	`DROP DATABASE someDatabase;`
create a table called `tablename1`, with the two columns shown, for the database currently in use	`# lots of other options available for how you specify the columns...` CREATE TABLE tablename1 ('fname' VARCHAR(20),'lname' VARCHAR(20)); # The apostrophes in the line above should be backticks (`). # Hugo's rendering engine won't let me put them there.
insert a row of data into the table `tablename1`	`INSERT INTO tablename1 VALUES('Richard','Mutt');`
delete the table `tablename1`	`DROP TABLE tablename1;`
show all data in the `departments` table	`SELECT * FROM departments;`
show just the `dept_no` and `dept_name` columns from the `departments` table	`SELECT dept_no, dept_name FROM departments;`
just get the first 5 rows from table `departments`	`SELECT * FROM departments LIMIT 5;`
show dept_name column values in table `departments` where dept_name has the substring "en"	`SELECT dept_name FROM departments WHERE dept_name LIKE "%en%";`
show all columns from table `departments` where the dept_name column starts with an "S" and has exactly 4 characters after it	`SELECT * FROM departments WHERE dept_name LIKE "S____";`
Select `title` values from the `titles` table but don't show duplicates	`SELECT DISTINCT title FROM titles;`
Same as above, but sorted (case-sensitive) by the `title` values	`SELECT DISTINCT title FROM titles ORDER BY title;`
Count the rows in the `departments` table	`SELECT count(*) FROM departments;`
Count the rows in the `departments` table that have "en" as a substring of the `dept_name` value	`SELECT count(*) FROM departments WHERE dept_name LIKE "%en%";`
In `tablename1`, change the `fname` value to "John" for all rows that have an `lname` value of "Mutt"	`UPDATE tablename1 SET fname="John" WHERE lname="Mutt";`
delete all rows from the `tablename1` table	`DELETE FROM tablename1;`
delete rows from the `tablename1` table where the `lname` value begins with "M"	`DELETE FROM tablename1 WHERE lname like "M%";`

MySQL-specific SQL prompt commands

list the tables in the currently selected database	`SHOW TABLES;`
Describe the columns in table `departments` (handy before doing SELECT statements to see column names and types)	`DESCRIBE departments;`
run the SQL commands stored in the file myscript.sql	`SOURCE myscript.sql;`
Load a local csv file (enabling this may require `--local-infile` with the mysql startup command or the adjustment of a config file)	`# Enter the following as one command LOAD DATA LOCAL INFILE '/some/path/names.csv' INTO TABLE tablename1 COLUMNS TERMINATED BY ',';`
Create new user `jane` with password `janepw`, then grant her access to everything	`CREATE USER 'jane' IDENTIFIED BY 'janepw'; GRANT ALL ON . TO 'jane';`

Handy MySQL commands from the operating system prompt

There are often multiple ways to execute some of the following tasks, but these work for me. Treat all as single-line commands.

start up MySQL with a single command (which includes the plain text password, which is not a good idea for any kind of production system)	`mysql -u someuser --password=somepassword`
Run a script of SQL commands from the operating system command line and then return to the command line; output of the SELECT statements will be tab-delimited	`mysql -u someuser --password=somepassword -t < employees.sql`
create a file of SQL commands to recreate the database employees (with the employees demo database, this created a 168MB file)	`mysqldump -u someuser --password=somepassword employees > makeemployees.sql`
Run a SQL command (or more than one using a semicolon to separate them) from the operating system prompt	`mysql -u someuser --password=somepassword -e 'USE employees; SELECT * FROM departments'`
Same as above, but getting output as tab-separated values--only difference is to add `-B` for "batch" mode	`mysql -u someuser --password=somepassword -B -e 'USE employees; SELECT * FROM departments'`

Other handy tricks, as covered in the MySQL documentation

The MySQL documentation’s Examples of Common Queries covers many additional useful tasks:

The Maximum Value for a Column
The Row Holding the Maximum of a Certain Column
Maximum of Column per Group
The Rows Holding the Group-wise Maximum of a Certain Column
Using User-Defined Variables
Using Foreign Keys
Searching on Two Keys
Calculating Visits Per Day
Using AUTO_INCREMENT

Note that my URL for the link to this information doesn’t include a version number, but gets redirected by mysql.com to the URL for the latest release’s version of this documentation, as documentation URLs should do.

CC BY-NC photo by duncan

Semantic web semantics vs. vector embedding machine learning semantics

Bob DuCharme — Sun, 25 Sep 2016 11:01:39 -0500

It's all semantics.

When I presented “intro to the semantic web” slides in TopQuadrant product training classes, I described how people talking about “semantics” in the context of semantic web technology mean something specific, but that other claims for computerized semantics (especially, in many cases, “semantic search”) were often vague attempts to use the word as a marketing term. Since joining CCRi, though, I’ve learned plenty about machine learning applications that use semantics to get real work done (often, “semantic search”), and they can do some great things.

Semantic Web semantics

To review the semantic web sense of “semantics”: RDF gives us a way to state facts using {subject, predicate, object} triples. RDFS and OWL give us vocabularies to describe the resources referenced in these triples, and the descriptions can record semantics about those resources that let us get more out of the data. Of course, the descriptions themselves are triples, letting us say things like {ex:Employee rdfs:subClassOf ex:Person}, which tells us that any instance of the ex:Employee class is also an instance of ex:Person.

That example indicates some of the semantics of what it means to be an employee, but people familiar with object-oriented development take that ability for granted. OWL can take the recording of semantics well beyond that. For example, because properties themselves are resources, when I say {dm:locatedIn rdf:type owl:TransitiveProperty}, I’m encoding some of the meaning of the dm:locatedIn property in a machine-readable way: I’m saying that it’s transitive, so that if {x:resource1 dm:locatedIn x:resource2} and {x:resource2 dm:locatedIn x:resource3}, we can infer that {x:resource1 dm:locatedIn x:resource3}.

A tool that understands what owl:TransitiveProperty means will let me get more out of my data. My blog entry Trying Out Blazegraph from earlier this year showed how I took advantage of OWL metadata to query for all the furniture in a particular building even though the dataset had no explicit data about any resources being furniture or any resources being in that building other than some rooms.

This is all built on very explicit semantics: we use triples to say things about resources so that people and applications can understand and do more with those resources. The interesting semantics work in the machine learning world is more about inferring semantic relationships.

Semantics and embedded vector spaces

(All suggestions for corrections to this section are welcome.) Machine learning is essentially the use of data-driven algorithms that perform better as they have more data to work with, “learning” from this additional data. For example, Netflix can make better recommendations to you now than they could ten years ago because the additional accumulated data about what you like to watch and what other people with similar tastes have also watched gives Netflix more to go on when making these recommendations.

The world of distributional semantics shows that analysis of what words appear with what other words, in what order, can tell us a lot about these words and their relationships—if you analyze enough text. Let’s say we begin by using a neural network to assign a vector of numbers to each word. This creates a collection of vectors known as a “vector space”; adding vectors to this space is known as “embedding” them. Performing linear algebra on these vectors can provide insight about the relationships between the words that the vectors represent. In the most popular example, the mathematical relationship between the vectors for the words “king” and “queen” is very similar to the relationship between the vectors for “man” and “woman”. This diagram from the TensorFlow tutorial Vector Representations of Words shows that other identified relationships include grammatical and geographical ones:

The popular open source word2vec implementation of this developed at Google includes a script that lets you do analogy queries. (The TensorFlow tutorial mentioned above uses word2vec; another great way to get hands-on experience with word vectors is Radim Rehurek’s gensim tutorial.) I installed word2vec on an Ubuntu machine easily enough, started up the demo-analogy.sh script, and it prompted me to enter three words. I entered “king queen father” to ask it “king is to queen as father is to what?” It gave me a list of 40 word-score pairs with these at the top:

     mother    0.698822
    husband    0.553576
     sister    0.552917
        her    0.548955
grandmother    0.529910
       wife    0.526212
    parents    0.512507
   daughter    0.509455

Entering “london england berlin” produced a list that began with this:

   germany     0.522487
   prussia     0.482481
   austria     0.447184
    saxony     0.435668
   bohemia     0.429096
westphalia     0.407746
     italy     0.406134

I entered “run ran walk” in the hope of seeing “walked” but got a list that began like this:

   hooray      0.446358
    rides      0.445045
ninotchka      0.444158
searchers      0.442369
   destry      0.435961

It did a pretty good job with most of these, but obviously not a great job throughout. The past tense of walk is definitely not “hooray”, but these inferences were based on a training data set of 96 megabytes, which isn’t very large. A Google search on phrases from the text8 input file included with word2vec for this demo shows that it’s probably part of a 2006 Wikipedia dump used for text compression tests and other processes that need a non-trivial text collection. More serious applications of word2vec often read much larger Wikipedia subsets as training data, and of course you’re not limited to using Wikipedia data: the exploration of other datasets that use a variety of spoken languages and scripts is one of the most interesting aspects of these early days of the use of this technology.

The one-to-one relationships shown in the TensorFlow diagrams above make the inferred relationships look more magical than they are. As you can see from the results of my queries, word2vec finds the words that are closest to what you asked for and lists them with their scores, and you may have several with good scores or none. Your application can just pick the result with the highest score, but you might want to first set an acceptable cutoff value so that you don’t take the “hooray” inference too seriously.

On the other hand, if you just pick the single result with the highest score, you might miss some good inferences, because while Berlin is the capital of Germany, it was also the capital of Prussia for over 200 years, so I was happy to see that get the second-highest score there—although, if we put too much faith in a score of 0.482481 (or even of 0.522487) we’re going to get some “king queen father” answers that we don’t want. Again, a bigger training data set would help there.

If you look at the demo-analogy.sh script itself, you’ll see various parameters that you can tweak when creating the vector data. The use of larger training sets is not the only thing that can improve the results above, and machine learning expertise means not only getting to know the algorithms that are available but also learning how to tune parameters like these.

The script is simple enough that I saw that I could easily revise it to make it read some other file instead of the text8 one included with it. I set it to read the Summa Theologica, in which St. Thomas Aquinas laid out all the theology of the Catholic Church, as I made grand plans for Big Question analogy queries like “man is to soul as God is to what?” My eventual query results were a lot more like the “run ran walk hooray” results above than anything sensible, with low scores for what it did find. With my text file of the complete Summa Thelogica weighing in at 17 megabytes, I was clearly hoping for too much from it. I do have ideas for other input to try and I encourage you to try it for yourself.

An especially exciting thing about the use of embedding vectors to identify potentially previously unknown relationships is that it’s not limited to use on text. You can use it with images, video, audio, and any other machine readable data, and at CCRi, we have. (I’m using the marketing “we” here; if you’ve read this far you’re familiar with all of my hands-on experience with embedding vectors.)

Embedding vector space semantics and semantic web semantics

Can there be any connection between these two “semantic” technologies? RDF-based models are designed to take advantage of explicit semantics, and a program like word2vec can infer semantic relationships and make them explicit. Modifications to the scripts included with word2vec could output OWL or SKOS triples that enumerate relationships between identified resources, making a nice contribution to the many systems using SKOS taxonomies and thesauruses. Another possibility is that if you can train a machine learning model with instances (for example, labeled pictures of dogs and cats) that are identified with declared classes in an ontology, then running the model on new data can do classifications that take advantage of the ontology—for example, after identifying new cat and dog pictures, a query for mammals can find them.

Going the other way, machine learning systems designed around unstructured text can often do even more with structured text, where it’s easier to find what you want, and I’ve learned at CCRi that RDF (if not RDFS or OWL) is much more popular among such applications than I realized. Large taxonomies such as those of the Library of Congress, DBpedia, and Wikidata have lots of synonyms, explicit subclass relationships, and sometimes even definitions, and they can contribute a great deal to these applications.

A well-known success story in combining the two technologies is IBM’s Watson. The paper Semantic Technologies in IBM Watson describes the technologies used in Watson and how these technologies formed the basis of a seminar course given at Columbia University; distributional semantics, semantic web technology, and DBpedia all play a role. Frederick Giasson and Mike Bergman’s Cognonto also looks like an interesting project to connect machine learning to large collections of triples. I’m sure that other interesting combinations are happening around the world, especially considering the amount of open source software available in both areas.

Converting between MIDI and RDF: readable MIDI and more fun with RDF

Bob DuCharme — Sun, 28 Aug 2016 12:24:21 -0500

Listen to my fun!

When I first heard about Albert Meroño-Peñuela and Rinke Hoekstra’s midi2rdf project, which converts back and forth between the venerable Musical Instrument Digital Interface binary format and RDF, at first I thought it seemed like an interesting academic exercise. Thinking about it more, I realized that it makes a great contribution to both the MIDI world and to musical RDF geeks.

MIDI has been the standard protocol for integrating synthesizers and related musical equipment together since the 1980s. I’ve only recently thrown out a book with the MIDI specs that I’ve owned for nearly that long because, as with so many other technical specifications, they’re now available online.

Meroño-Peñuela and Hoekstra’s midi2rdf lets you convert between MIDI files and Turtle RDF. I love the title of their ESWC 2016 paper on it, “The Song Remains the Same” (pdf)–I was pretty young when Led Zeppelin’s Houses of the Holy album came out, but I remember it vividly. The song remains the same because the project’s midi2rdf and rdf2midi scripts provide lossless round trip compression between the two formats, which makes it a very valuable tool: it gives us a text file serialization of MIDI based on a published standard, which makes MIDI downright readable. Looking at these RDF files and spending no serious time with the MIDI spec, I worked out which resources and properties were doing what and used this to create my own MIDI files.

As a somewhat musical RDF geek, this was a lot of fun. I wrote Python scripts to generate different Turtle files of different kinds of random music, then converted them to MIDI so that I could listen to them. (You can find it all in github.) The use of random functions means that running the same script several times creates different variations on the music. Below you will find links to MP3 versions of what I called fakeBebop and two versions of some whole-tone piano music that I generated, along with the MIDI and RDF files that go with them.

Each MIDI file (and its RDF equivalent) starts with some setup data to identify information such as the sounds that it will play and the tempo. Instead of learning all those setup details for my program to generate, I used the excellent Linux/Mac/Windows open source MuseScore music scoring program to generate a MIDI file with just a few notes of whatever instruments I wanted and then converted that to RDF. (This ability to convert in both directions is is an important part of the value of the midi2rdf package.) Then, keeping the setup part of that RDF, I deleted the actual notes and had my script copy the setup part and then generate new notes that it appended to the setup part.

In RDF terms, the note generation meant two things: adding a pair of mid:NoteOnEvent resources (one to start playing a note and one to stop) and then adding references to those events onto a musical track listing the events to execute. So, for example, the first mid:NoteOnEvent in the following pair defines the start of of a note at pitch 69, which is A above middle C on a piano. The mid:channel of 0 had been defined in the setup part, and the mid:tick value specifies how long the note will play until the next mid:NoteOnEvent. (I was too lazy to look up how the mid:tick values relate to elapsed time and picked some through trial and error.) The mid:velocity values essentially turn the note on and off.

p2:event0104 a mid:NoteOnEvent ;
    mid:channel 0 ;
    mid:pitch 69 ;
    mid:tick 400 ;
    mid:velocity 80 .


p2:event0105 a mid:NoteOnEvent ;
    mid:channel 0 ;
    mid:pitch 69 ;
    mid:tick 500 ;
    mid:velocity 0 .

As my script outputs noteOn events after the setup part, it appends references to them onto a string in memory that begins like this:

mid:pianoHeadertrack01 a mid:Track ;
    mid:hasEvent p2:event0000,
        p2:event0001,
        p2:event0002,
        p2:event0003,
        # etc. until you finish with a period

After outputting all the mid:NoteOnEvent events, the script outputs this string. (While the triples in this resource are technically unordered, rdf2midi seemed to assume that the event names are “event” followed by a zero-padded number. When an early version of my first script didn’t do this, the notes got played in an odd order. Maybe it’s just playing them in alphabetic sort order.)

That’s all for just one track. My fakeBebop script does this for three tracks: a bass track playing fairly random quarter notes in the range of an upright bass, a muted trumpet track playing fairly random triplet-feel eighth notes (sometimes with a rest substituted), and a percussion track repeating a standard bebop ride cymbal pattern. You can see some generated Turtle RDF at fakeBebop.ttl, the MIDI file generated from the Turtle file by midi2rdf at fakeBebop.mid, and listen to what it sounds like at fakeBebop.mp3.

By “fairly random” I mean a random note within 5 half steps (a major third) of the previous note. Without any melodies beyond this random selection of notes, I think it still sounds a bit beboppy because, as the early bebop pioneers added more complex scales to the simple major and minor scales played by earlier jazz musicians, it all got more chromatic.

I have joked with my brother about how if you quietly play random notes on a piano with both hands using the same whole tone scale, it can sound a bit like Debussy, who was one of the early users of this scale. My wholeTonePianoQuarterNotes.py script follows logic similar to the fakeBebop script but outputs two piano tracks that correspond to a piano player’s left and right hands and use the same whole tone scale. You can see some generated Turtle RDF at wholeTonePianoQuarterNotes.ttl, the MIDI file generated from that by rdf2midi at wholeTonePianoQuarterNotes.mid, and hear what it sounds like at wholeTonePianoQuarterNotes.mp3.

Before doing the whole tone piano quarter notes script I did one with random note durations, so it sounds like something from a bit later in the twentieth century. Generated Turtle RDF: wholeTonePiano.ttl; MIDI file generated by rdf2midi: wholeTonePiano.mid; MP3: wholeTonePiano.mp3.

I can think of all kinds of ideas for additional experiments, such as redoing the two piano experiments with the four voices of a string quartet or having the fakeBebop one generate common jazz chord progressions and typical licks over them. (Speaking of string quartets and Debussy, I love that Apple iPad Pro ad that NBC showed so often during the recent Olympics.) It would also be interesting to try some experiments with Black MIDI (or perhaps “Black RDF”!). If I had pursued these ideas, I wouldn’t be writing this blog entry right now, because I had to cut myself off at some point.

I recently learned about Supercollider, an open source Windows/Mac/Linux IDE with its own programming language that several serious electronic music composers use for generating music, and I could easily picture spending all of my free time playing with that. At least midi2rdf’s RDF basis gave me the excuse of having a work-related angle as I wrote scripts to generate odd music. Although I was just slapping together some demo code for fun, I do think that midi2rdf’s ability to provide lossless round-trip conversion between a popular old binary music format and a readable standardized format has a lot of potential to help people doing music with computers.

SPARQL in a Jupyter (a.k.a. IPython) notebook

Bob DuCharme — Sun, 31 Jul 2016 10:15:07 -0500

With just a bit of Python to frame it all.

In a recent blog entry for my employer titled GeoMesa analytics in a Jupyter notebook, I wrote

As described on its home page, “The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.” Once you install the open source Jupyter server on your machine, you can create notebooks, share them with others, and learn from notebooks created by others. (You can also learn from others’ notebooks without installing Jupyter locally if those notebooks are hosted on a shared server.)

An animated GIF below that passage shows a sample mix of formatted text and executable Python code in a short Jupyter notebook, and it also demonstrates how code blocks can be tweaked, run in place, and build on previous code blocks. The blog entry goes on to describe how we at CCRi embedded Scala code in a Jupyter notebook to demonstrate the use of Apache Spark with the Hadoop-based GeoMesa spatio-temporal database to perform data analysis and visualization.

Jupyter supports over 40 languages besides Scala and Python, but not SPARQL. I realized recently, though, that with a minimum of Python code (Python being the original language for these notebooks; “Jupyter” was originally called “IPython”) someone who hardly knows Python can enter and run SPARQL queries in a Jupyter notebook.

I created a Jupyter notebook that you can download and try yourself called JupyterSPARQLFun. If you look at the raw version of the file you’ll see a lot of JSON, but if you follow that link you’ll see that github renders the notebook the same way that a Jupyter server does, so you can read through the notebook and see all the formatted explanations with the code and the results.

If you did download the notebook and run it on a Jupyter server (and installed the rdflib and RDFClosure python libraries), you could edit the cells that have executable code, rerun them, and see the results, just like in the animated GIF mentioned above. In the case of this notebook, you’d be doing SPARQL manipulation of an RDF graph from your copy of the notebook. (I used the Anaconda Jupyter distribution. It was remarkably difficult to find out from their website how to start up Jupyter, but I did find out from the Jupyter Notebook Beginner Guide that you just enter “jupyter notebook” at the command line. When working with a notebook, you’ll also find this list of keyboard shortcuts to be handy.)

I won’t go into great detail here about what’s in the JupyterSPARQLFun notebook, because much of the point of these notebooks is that their ability to mix formatted text with executable code lets people take explanation of code to a new level. So, to find out how I got SPARQL and inferencing working in the notebook, I recommend that you just read the explanations and code that I put in it.

I mentioned above how you can learn from others’ notebooks; some nice examples accompany the Data School Machine Learning videos on YouTube. These videos demonstrate various concepts by adding and running code within notebooks, adding explanatory text as well along the way. Because I could download the finished notebooks created in the videos, I could run all the example code myself, in place, with no need to copy it from one place and paste it to another. I could also tweak the code samples to try different variations, which made for some much more hands-on learning of the machine learning concepts being demonstrated.

That experience really showed me the power of Jupyter notebooks, and it’s great to see that with just a little setup Python code, we can do SPARQL querying and RDF inferencing inside these notebooks as well.

Emoji SPARQL😝!

Bob DuCharme — Sun, 12 Jun 2016 11:46:31 -0500

If emojis have Unicode code points, then we can...

I knew that emojis have Unicode code points, but it wasn’t until I saw this goofy picture in a chat room at work that I began to wonder about using emojis in RDF data and SPARQL queries. I have since learned that the relevant specs are fine with it, but as with the simple display of emojis on non-mobile devices, the tools you use to work with these characters (and the tools used to build those tools) aren’t always as cooperative as you’d hope.

After hunting around a bit among these tools, I did have some with fun this. Black and white emojis, as shown in the Browser column of the unicode.org Emoji Data page, display with no problem in my Ubuntu terminal window and in web page forms, but I wanted the full-color emojis from that page’s Sample column. The Emacs Emojify mode did the trick, so what you see below are screen shots from there.

I started by converting that same unicode.org web page (as opposed to the site’s much larger Full Emoji Data page) to a Turtle file called emoji-list.ttl with a short perl script. (You can find both in github at emojirdf.) On the right, you can see triples from that web page’s row about the french fries emoji. For the keywords assigned to each character, the Emoji Data web page has links, so it was tempting to use the link destinations as URI values for the lse:annotation values instead of strings, but some of those link destinations have local names like +1, which won’t make for nice URIs in RDF triples.

I thought about augmenting my emoji-list.ttl file to turn it into an emoji ontology. I first dutifully searched for “emoji rdf” on Google (which asked me “did you mean emoji pdf? emoji def?”) to avoid the reinvention of any wheels. The most promising search result was an Emoji Ontology that adds some interesting metadata to the emojis, but its Final emoji ontology in OWL/XML format has little to do with OWL or even RDF, and I didn’t feel like writing the XSLT to convert its additional metadata to proper RDF.

With no proper emoji ontology already available, I thought more about creating my own by adding triples that would arrange the emojis into a hierarchical ontology or taxonomy. This would let me say that the ant 🐜 and the honeybee 🐝 are both insects, and that the ox 🐂 and the many, many cats are mammals, and then I could query for animals and see them all or query for insects and see just first two. This would add little, though, because the existing annotation values already serve as a non-hierarchical tagging system that identifies insects, so I could just query for those lse:annotation values.

Some of these annotation values led to some fun queries of the emoji-list.ttl file. I used Dave Beckett’s Redlands roqet as a query processor, telling it to give me CSV data that I redirected to a file. Here’s a query asking for the character and label of any emojis that have both “face” and “cold” in their annotation values:

PREFIX lse:  <http://learningsparq.com/emoji/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


SELECT ?char ?label
WHERE 
{ 
  ?s lse:annotation 'face', 'cold' ;
     rdfs:label ?label ;
     lse:char ?char .
}

It returned this result, showing that “cold” can refer to both low temperature and wintertime sniffles:

This next query uses emojis in string data to ask which annotations have tagged both the alien head and one of the moon face emojis:

(Apparently, Emacs SPARQL mode thinks that the “not” in “annotation” is the SPARQL keyword, because it resets the substring’s font color.) Here is the query result; note that, as is typical with many query tools, the first row is the variable name, not a returned value:

annotation
face
nature
space

Emoji Unicode code points range from x1F600 to x1F1FF, which SPARQL spec productions 164 - 166 say are legal for use in variable names. The following query requests the satellite dish character’s annotation values and stores them in a variable whose three-character name is three emojis:

Here is our result:

This is actually why I used roqet—the Java-based SPARQL engines that I first tried may have implemented the spec faithfully, but some layer of the Java tooling underneath them couldn’t handle the full extent of Unicode in every place where it should.

Emojis in RDF data are not limited to quoted strings. When I told roqet to run a query against this next Turtle file, which uses emoji characters as prefixes and as subject and predicate local names in its one triple, it had no problem:

This final query went even further, and roqet had no problem with it: it defines a bowl of spaghetti emoji as a namespace prefix and then, using emojis for the variable names, asks for the subjects and objects of any triples that have the predicate from the one triple in the Turtle file above.

Of course, it’s difficult to read, and the fact that running the query and even just displaying it required me to dig around for the right combination of tools doesn’t speak well for the use of emojis in queries. Besides being a fun exercise, though, the experience and the result—that it all ultimately worked—provided a nice testament to the design of the Unicode, RDF, and SPARQL standards.

Trying out Blazegraph

Bob DuCharme — Tue, 17 May 2016 08:17:00 -0500

Especially inferencing.

I’ve been hearing more about the Blazegraph triplestore (well, “graph database with RDF support”), especially its support for running on GPUs, and because they also advertise some degree of RDFS and OWL support, I wanted to see how quickly I could try that after downloading the community edition. It was pretty quick.

Downloading from the main download page with my Ubuntu machine got me an rpm file, but I found it simpler to download the jar file version that I could start as a server from the command line as described on the Nano SPARQL Server page. I found the jar file (and several other download options) on the sourceforge page for release 2.1.

The jar file’s startup message tells you the URL for the web-based interface to the Nano SPARQL Server, shown here:

At this point, uploading some RDF on the UPDATE tab and issuing SPARQL queries on the QUERY tab was easy. I was more interested sending it SPARQL queries that could take advantage of RDFS and OWL inferencing, so after a little help from Blazegraph Chief Scientist Bryan Thompson via their mailing list (with a quick answer on a Saturday) I learned how: I had to first create a namespace on the NAMESPACES tab with the Inference checkbox checked. The same form also offers checkboxes for Isolatable indexes, Full text index, and Enable geospatial when configuring a new namespace. I found this typical of how Blazegraph lets you configure it to take advantage of more powerful features while leaving the out-of-box configuration simple and easy to use.

For finer-grained namespace configuration, after you select checkboxes and click the Create namespace button, a dialog box lets you edit the configuration details, with each of these lines explained in the Blazegraph documentation:

I wanted to check Blazegraph’s support for owl:TransitiveProperty, because this is such a basic, useful OWL class, as well as its ability to do subclass inferencing. I created some data about chairs, desks, rooms, and buildings, specifying which chairs and desks were in which rooms and which rooms were in which buildings, and also made dm:locatedIn a transitive property:

@prefix d: <http://learningsparql.com/ns/data#> .
@prefix dm: <http://learningsparql.com/ns/demo#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .


dm:Room rdfs:subClassOf owl:Thing .
dm:Building rdfs:subClassOf owl:Thing .
dm:Furniture rdfs:subClassOf owl:Thing .
dm:Chair rdfs:subClassOf dm:Furniture .
dm:Desk rdfs:subClassOf dm:Furniture .


dm:locatedIn a owl:TransitiveProperty. 


d:building100 rdf:type dm:Building .
d:building200 rdf:type dm:Building .
d:room101 rdf:type dm:Room ; dm:locatedIn d:building100 . 
d:room102 rdf:type dm:Room ; dm:locatedIn d:building100 . 
d:room201 rdf:type dm:Room ; dm:locatedIn d:building200 . 
d:room202 rdf:type dm:Room ; dm:locatedIn d:building200 . 


d:chair15 rdf:type dm:Chair ; dm:locatedIn d:room101 . 
d:chair23 rdf:type dm:Chair ; dm:locatedIn d:room101 . 
d:chair35 rdf:type dm:Chair ; dm:locatedIn d:room202 . 
d:desk22 rdf:type dm:Desk ; dm:locatedIn d:room101 . 
d:desk59 rdf:type dm:Desk ; dm:locatedIn d:room202 .

The following query asks for furniture in building 100. No triples above will match either of the query’s two triple patterns, so a SPARQL engine that can’t do inferencing won’t return anything. I wanted the query engine to infer that if chair 15 is a Chair, and Chair is a subclass of Furniture, then chair 15 is Furniture; also, if that furniture is in room 101 and room 101 is in building 100, then that furniture is in building 100.

PREFIX dm: <http://learningsparql.com/ns/demo#> 
PREFIX d: <http://learningsparql.com/ns/data#> 
SELECT ?furniture
WHERE 
{ 
  ?furniture a dm:Furniture .
  ?furniture dm:locatedIn d:building100 . 
}

We need the first triple pattern because the data above includes triples saying that rooms 101 and 102 are located in building 100, so those would have bound to ?furniture in the second triple pattern if the first triple pattern wasn’t there. This is a nice example of why declaring resources as instances of specific classes, while not necessary in RDF, does a favor to anyone who will query that data—it makes it easier for them to specify more detail about exactly what data they want.

When using this query and data in a namespace (in the Blazegraph sense of the term) configured to do inferencing, Blazegraph executed the query against the original triples plus the inferred triples and listed the furniture in building 100:

Several years ago I backed off from discussions of the “semantic web” as a buzzphrase tying together technology around RDF-related standards because I felt that the phrase was not aging well and that the technology could be sold on its own without the buzzphrase, but the example above really does show semantics at work. Saying that dm:locatedIn is a transitive property stores some semantics about that property, and these extra semantics let me get more out of the data set: they let me query for which furniture is in which building, even though the data has no explicit facts about furniture being in buildings. (Saying that Desk and Chair are subclasses of Furniture also stores semantics about all three terms, but that won’t be as interesting to a typical developer with object-oriented experience.)

Blazegraph calls their subset of OWL RDFS+, which was inspired by Jim Hendler and Dean Allemang’s RDFS+ superset of RDF that added in OWL’s most useful bits. (It’s similar but not identical to AllegroGraph’s RDFS++ profile, which has the same goal.) Blazegraph’s Product description page describes which parts of OWL it supports, and their Inference And Truth Maintenance page describes more.

A few other interesting things about Blazegraph as a triplestore and query engine:

The REST interface offers access to a wide range of features.
Queries can include Query Hints to optimize how the SPARQL engine executes them, which will be handy if you plan on scaling way up.
I saw no no direct references to GeoSPARQL in the Blazegraph documentation, but they recently announced support for geospatial SPARQL queries. (I’ve been learning a lot about working with geospatial data at Hadoop scale with GeoMesa.)

Blazegraph’s main selling points seems to be speed and scalability (for example, see its Scaleout Cluster mode) and I didn’t play with those at all, but I liked seeing that SPARQL querying with inferencing support can take advantage of such new hotness technology as GPUs. It will be interesting to see where Blazegraph takes it.

Playing with a proximity beacon

Bob DuCharme — Sat, 23 Apr 2016 08:30:34 -0500

Nine-dollar devices send URLs to your phone over Bluetooth.

I’ve been hearing about proximity beacons lately and thought it would be fun to try one of these inexpensive devices that broadcast a URL for a range of just a few meters via Bluetooth Low Energy (a.k.a. BLE, which I assume is pronounced “bleh”). Advocates often cite the use case of how a beacon device located near a work of art in a museum might broadcast a URL pointing to a web page about it—for example, one near Robert Rauschenberg’s Bed in New York’s Museum of Modern Art could broadcast the URL http://moma.org/collection/works/78712, their web site’s page with information about the work. When the appropriate app on your phone (or perhaps your phone’s operating system) saw this, it would alert you to the availability of this localized information.

You can find these beacons for as little as $14, and even cheaper on eBay, where colorful bracelet versions can cost less then $10. Most need batteries, typically the kind you put in a watch, so to avoid this I got a RadBeacon USB from Radius Technologies that draws its power from any USB port where you plug it in. At the right you can see mine plugged into a conference swag phone recharger.

I also chose this one because it supports Google’s Eddystone open beacon format, Apple’s iBeacon format, and Radius Network’s AltBeacon. I haven’t dug into the pros and cons of these different formats yet; I just wanted something that was likely to work out of the box with both my Samsung S6 Android phone and my wife’s iPhone. The RadBeacon USB did fine.

You configure it with a phone app built for that particular beacon product line. The Android RadBeacon app generally worked, although I often had to press “Apply” several times and restart Bluetooth before new settings would actually take hold. Its documentation shows the kinds of properties it lets you set, such as the URL to broadcast and the Transmit Power (which affects the battery life and the distance that the URL is broadcast—in a museum, you want people receiving the URL of the painting in front of them, not the one twenty feet to the left of it).

I had set mine to the URL of a sample web page that I created for this purpose. While waiting for my RadBeacon to arrive in the mail, after Dan Brickley tweeted the mobiForge article Eddystone beacon technology and the Physical Web, I learned a lot from it about which components of my web page would be picked up by an app that received the broadcast URL.

After I configured the beacon, the open source physical web app found it and displayed the following on my Samsung S6:

Tapping the blue title took the phone to the web page. This all worked the same, with the same app, on my wife’s iPhone.

I don’t want to have to bring such an app to the foreground every time I want to check for nearby beacons, so I was glad to see that the app also added something to my phone’s notifications list:

Touching the notification sent the phone to the referenced web page.

Both notifications above show what the app pulled from my sample web page: the content of the head element’s title element and the value of the content attribute from the meta element that had a name attribute value of “description”. They also displayed the hastily-drawn favicon image I created for the web page.

A beacon won’t broadcast just any URI that you want, because the allowable length is somewhat limited. (This could vary by beacon product.) The article mentioned above describes the role of URL shorteners in the architecture. Still, the idea of such inexpensive hardware using URIs to identify things brings a nice semantic web touch to an Internet of Things architecture.

One experiment I tried was the use of Audio Tag Tool to add every metadata field available to an MP3. I then configured my beacon to broadcast that MP3’s URL, but none of the metadata showed up on my phone’s display. I thought that the idea of location-specific audio might be interesting. (You could also implement location-specific audio with much older technology—for example, Victrolas—but the ability to control the audio from a central server could lead to interesting possibilities.)

The museum use case for beacons is nice and cultured, but I wonder about the attraction of a technology whose real main use case for now is to pump ads at people. (When was the last time you scanned a QR code with your phone?) I say “for now” because I remain hopeful that creative people will come up with more interesting things to do with these, especially if they dig into the Eddystone, iBeacon, and AltBeacon APIs. For example, you could add features to your own apps to check for or even act as beacons, communicating with other beacons and apps around your phone whether these devices had Internet connections or not. The Opera browser’s use of schema.org metadata stored in web pages referenced by beacons is also promising, and I know that Dan is putting more thought into what role schema.org can play.

The idea of the broadcast URL showing up as a notification on your phone that you can follow or ignore is much simpler than starting up a special app on your phone and then pointing the phone at one corner of a poster, which the QR enthusiasts thought we’d be happier to do. The short article 5 Common Misconceptions About Beacons and Proximity Marketing gives a good perspective on where beacons can fit into the communications ecosystem in general and the world of marketing in particular. The article is from one of several companies building a business model around advertising via beacons, but like I said above, I hope that the APIs inspire other users for them as well.

Adding custom menus to Google docs

Bob DuCharme — Sun, 20 Mar 2016 12:00:15 -0500

Using Google Apps Script, but unfortunately not in Google apps.

I’ve been using Google Docs more because at work it’s great for collaboration, and also, for shopping lists and notes to myself, I can easily edit the same documents from my phone, tablet, and laptop. I found out that it’s pretty easy to add menus that perform custom functions, so I created a few menu choices… and then found out that they weren’t available on my phone or tablet. Still, it’s good to know how easy it is to automate a few things.

Extending Google Docs is a good introduction to getting started. Picking Script Editor from the Tools menu puts you into this editor with an empty function waiting for you to fill it in or, more likely, to replace it with code you copied from web pages such as “Extending Google Docs.” Google Apps Script is basically Javascript, and I had an easy time searching for any code that I wanted to plug in.

For example, when writing a note about something, I sometimes want to add a date-time stamp to show exactly when I made a particular note, because if it’s ongoing research it’s easier to see my progress leading up to where I left off. (I’ve had my .emacs file set up to let me add this with Alt+D for years.) To add a timestamp menu choice to Google Docs, I replaced the blank function in the script editor with menu code based on what you see in Custom Menus in Google Apps, and then I added a line to insert the current date and time at the cursor using the format “Sun Mar 13 2016 10:40:33 GMT-0400 (EDT).” I’d prefer the terser ISO 8601 format, and I found a function to convert it, but the function wants to know what time zone you’re in, and the simpler Date() function that creates the more verbose form already knows.

When I read something on my tablet and I’m taking notes, I often paste blocks of text into a Google docs document. To remember which parts are large verbatim blocks of someone else’s writing, I enclose them in <blockquote></blockquote> tags. My second new menu item inserts this string and then moves the cursor between those tags so that if I have something in my copy-paste buffer I can just paste it right there. The “utilities” menu that I added also demonstrates how to add a menu separator and a submenu that pops up a message box.

The code is all shown below. If I want to share these features across multiple documents, to be honest, the simplest way I’ve found is to paste this code into the script editor for each of the other documents. This is not, if I may string together some buzzwords, a scalable code maintenance solution.

These are known as “bound” scripts because they’re bound to specific documents. You can also create standalone scripts, which I hoped would be a way to store shared code that could be referenced from multiple documents, but you actually run them independently of the documents to perform tasks that are not tied to any specific document such as, in the example on that page, searching Google Drive for documents meeting certain conditions.

If you have a script that adds choices to a document and you want to use it from multiple documents, you must publish it. As the Publishing an Add-on web page says,

Publishing add-ons allows them to be used by other users in their own documents. Public add-ons require a review before publication, although if you are a member of a private Google Apps domain, you can publish just for users within your domain without a review. You can also publish an add-on for domain-wide installation, which lets a domain admins find [sic], authorize and install your add-on on behalf of all users within their domain.

There’s even an add-on store with offerings available from some recognizable brand names.

I never did find a way to create a single script that I could share among my own documents without going through some approval process. In an even greater disappointment, I found that the menu I created was not available when editing that same document on my phone or tablet, which was much of the point of creating them. In other words, this part of Google Apps script doesn’t work with Google apps.

Still, skimming the Apps Script Reference for available methods to call when customizing for Google Docs, spreadsheets, calendars, and more shows that there’s a lot to play with, and I didn’t even try a standalone script. If this ever works on phones and tablets, I will definitely be digging back into the reference material again.

function onOpen() {
  var ui = DocumentApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('utilities')
      .addItem('timestamp', 'insertTimestamp')
      .addItem('blockquote', 'insertBqTags')
     .addSeparator()
      .addSubMenu(ui.createMenu('Sub-menu')
          .addItem('Second item', 'menuItem2'))
      .addToUi();
}


function insertTimestamp() {
  DocumentApp.getUi() ; 
  var doc = DocumentApp.getActiveDocument(); 
  var body = doc.getBody();
  // The following gives me ISO format, which I prefer, but unlike Date(), 
  // needs to be told the time zone 
  // var timestamp = Utilities.formatDate(new Date(), "EDT", "yyyy-MM-dd'T'HH:mm:ss"); 
  var timestamp = new Date();
  // https://developers.google.com/apps-script/reference/document/document#getcursor
  // has error-checking code for the following that would make it more robust.
  var cursor = DocumentApp.getActiveDocument().getCursor();
  var element = cursor.insertText(timestamp);
}


function insertBqTags() {
  DocumentApp.getUi() ;
  var doc = DocumentApp.getActiveDocument(); 
  var body = doc.getBody();
  var cursor = DocumentApp.getActiveDocument().getCursor();
  var insertedText = cursor.insertText("<blockquote></blockquote>");
  var position = doc.newPosition(insertedText, 12);
  doc.setCursor(position); 
}


function menuItem2() {
  DocumentApp.getUi() // Or DocumentApp or FormApp.
     .alert('You clicked the second menu item!');
}

"Readings in Database Systems": wisdom from Michael Stonebraker

Bob DuCharme — Sat, 27 Feb 2016 11:03:23 -0500

and two other guys--updated and free online.

As I tweeted last July, I always learn so much about both the past and future of database computing from recent Turing Award winner Michael Stonebraker. I recently learned that the latest edition of Readings in Database Systems, also known as the “Red Book,” is available for free online under a Creative Commons license—or at least the introductions to the readings are. With most of these being by Stonebraker, and quite up-to-date, I consider these 43 pages required reading for anyone interested in database technology.

The serious student should find and read the actual papers, but I learned plenty from the introductions by Stonebraker and his co-editors Peter Bailis and Joe Hellerstein. (Ben Lorica’s recent podcast interview with Hellerstein is also worth a listen.) For example, after reading the introduction to chapter 4 I now have me a much better understanding of the advantages of column stores over more traditional row stores, and chapter 12 helped me to understand the history of Data Warehouses and the role of ETL much better.

This is the fifth edition of the book, published in 2015, so it is very current, as you can see from the way it treats MapReduce as past history. They published the first edition in 1988, so this has clearly been a long-term project, and it’s interesting to see which twentieth century papers appear in the new fifth edition—for example, Sergey Brin and Larry Page’s 1998 classic The Anatomy of a Large-scale Hypertextual Web Search Engine.

Several of Stonebraker’s more opinionated assertions were enough fun to read they tempted me to start a fake Twitter account, modeled on the hilarious @boredElonMusk, that I would call @crankyMikeStonebraker. It would feature real quotes from the Red Book such as these:

“SQL will be the COBOL of 2020, a language we are stuck with that everybody will complain about.”
“[JSON] is a disaster in the making as a general hierarchical data format.”
“I consider ODBC among the worst interfaces on the planet.”
“The rest of the world is seeing what Google figured out earlier; Map-Reduce is not an architecture with any broad scale applicability.”
“The MapReduce crowd has turned into a SQL crowd and Map-Reduce, as an interface, is history.”
“Just because Google thinks something is a good idea does not mean you should adopt it.”
“We begin with a sad truth. Most data science platforms are file-based and have nothing to do with DBMSs.”
“the new buzzword is master data management (MDM)… MDM is the opposite of business agility.”

While the very title of “Readings in Database Systems” will make some peoples’ eyes glaze over, bits like these make it much more fun to read than many would expect, especially if you care at all about the role that database systems play in modern applications.

Photo of Michael Stonebraker by D Coetzee via flickr (CC0)

The past and present of hypertext

Bob DuCharme — Sun, 17 Jan 2016 10:58:51 -0500

You know, links in the middle of sentences.

I’ve been thinking lately about the visionary optimism of the days when people dreamed of the promise of large-scale hypertext systems. I’m pretty sure they didn’t mean linkless content down the middle of a screen with columns of ads to the left and right of it, which is much of what we read off of screens these days. I certainly don’t want to start one of those rants of “the World Wide Web is deficient because it’s missing features X and Y, which by golly we had in the HyperThingie™ system that I helped design back in the 80s, and the W3C should have paid more attention to us” because I’ve seen too many of those. The web got so popular because Tim Berners-Lee found such an excellent balance between which features to incorporate and which (for example, central link management) to skip.

The idea of inline links, in which words and phrases in the middle of sentences link to other documents related to those words and phrases, was considered an exciting thing back when we got most of information from printed paper. A hypertext system had links between the documents stored in that system, and the especially exciting thing about a “world wide” hypertext system was that any document could link to any other document in the world.

But who does, in 2016? The reason I’ve been thinking more about the past and present of hypertext (a word that, sixteen years into the twenty-first century, is looking a bit quaint) is that since adding a few links to something I was writing at work recently, I’ve been more mindful of which major web sites include how many inline links and how many of those links go to other sites. For example, while reading the article Bayes’s Theorem: What’s the Big Deal? on Scientific American’s site recently, I found myself thinking “good for you guys, with all those useful links to other web sites right in the body of your article!”

To get some idea of relative proportions of internal links, external links, and linkless text on today’s successful websites, I went to a top 15 most popular blogs list and did some random checking of articles on these sites. (An exercise for the reader to make up for my haphazard skimming: write some scripts to scrape some editorial content from each site, count the internal and external links, and produce a bar chart.) Because these are professionally managed sites, I imagine that management at some of them encourage links to other articles on the same site and discourage links to others as a matter of policy, because they want to keep their readers looking at their advertisers’ ads.

There is a gray area between internal and external links: linking to other sites that are part of the same organization, such the many links in a Business Insider article to Tech Insider articles, or the many links between members of the Gawker Media stable, which is heavily represented in the top 15.

Of those top 15:

Huffington Post: a mix of internal and external links, but their number of external links fits with their business model of being a hub of other sites’ content.
All about the internal links: TMZ, Mashable, Gawker, The Daily Beast, Engadget, Jezebel (where most external links are to their Gawker Media sibling Gawker).
Deadspin: a reasonable percentage of external links.
Gawker Media’s video game site Kotaku: long stretches of text with no links, and others with both internal and external links.
TechCrunch: mostly internal and several to Gizmodo, even though TechCrunch is an AOL site and Gizmodo a Gawker media site.
Gawker Media’s lifehacker, which is probably the site I visit most of all those listed here: external links if an article describes the external site’s article, company, or product, but otherwise, internal links.
Perez Hilton: mostly internal links; external links tend to be redirected via goo.gl, I suppose so that Mr. Hilton’s people can track which external links get clicked.
Gawker Media’s Gizmodo: plenty of external links, even to non-Gawker sites, for a gadget site that I assume is mostly interested in helping advertisers sell gadgets.
Cheezburger: textual content not much of an issue here.

I’m guessing that there is no policy across all of Gawker Media about the use of links, but that each of their major properties has some sort of policy in place. (For an interesting, explicit enumeration of one carefully managed site’s linking policy, see the guidelines at IBM Developer Works.)

On particularly link-rich bit of content that I read regularly is Data is Plural, which ironically is delivered via email—a technology that had a firm foothold in the Internet before Berners-Lee came up with the Web, and which most young people today only use to communicate with us old people.

Who even thinks about hypertext as hypertext anymore? A quick look at the former Usenet newsgroup (and now Google Group) alt.hypertext shows an average of about one new message or comment per month for the last few years, including spam. (Compare January of 1998, when the newsgroup had 39 topics with one or more postings in that one month.) The most recent topic shown is titled “NCSA Mosaic for X 0.10 available” from Marc Andreesen, posted—I thought—last month, making me think “isn’t he a bit busy for Mosaic these days?” It turned out that last month someone added a comment to his original 1993 post. A relatively recent new topic is Paul Ford’s January 2014 query “Do documents have a chance? Or is the future more and smarter optimized applications?” Actually, that makes a solid answer to my question that began this paragraph: Paul Ford, and I’m really looking forward to his upcoming book.

The hypertext “novel” I bought in 1994 for $25

My new job

Bob DuCharme — Sun, 20 Dec 2015 09:27:39 -0500

Lots of cutting edge technologies, 18 minutes from my home.

I recently began a new full-time position as a technical writer at Commonwealth Computer Research, Inc., more commonly known as CCRi. CCRi was doing large-scale data science long before the term “data science” became so popular; one company founder also directs the University of Virginia’s Data Science Institute. They also do a lot of work with distributed machine learning and other cutting edge technologies, especially in the area of geospatial analytics. The chance to work with so many different interesting new technologies and smart people—engineering and math PhD’s tend to be the norm instead of the exception—right here in Charlottesville, after telecommuting for over eight years, was just too good to pass up.

Having recently grown to over 80 employees, CCRi has gotten large enough that it’s become difficult for everyone there to know about all the technology and projects going on in other parts of the company. Part of my role will be to help with that, documenting these things so that it’s easier for people to find connections between the different existing and new efforts underway. I’ll also be helping them with marketing and business development.

RDF and SPARQL do play a role in some of the projects there, mostly using the Rya triplestore because of its use of Apache Accumulo for storage. Accumulo is a key-value pair NoSQL database built on Hadoop whose design is based on Google’s BigTable database, and it plays an important part in several CCRi projects.

One of the biggest projects at CCRi is GeoMesa, which is described by its product page is “an open-source solution maintained and supported by CCRi for storing, indexing, querying, transforming, and visualizing spatio-temporal data at scale in Accumulo.” For a start, it adds to Accumulo what PostGIS adds to PostgreSQL: datatypes, functions, and more features that make it easy to store and query geospatial data. Going beyond that, GeoMesa lets you store spatio-temporal data, so that event timestamps can play a role in applications that use GeoMesa. Apache Kafka provides GeoMesa with some nice infrastructure for handling real time streaming data. For example, it was used to create this animated U.S. map of tweets over the 2015 Super Bowl week.

As alternatives to using Accumulo for storage, GeoMesa can also use Apache HBase and Google Cloud BigTable, the public version of Google’s internal Bigtable storage system. After Google heard about this, they contacted CCRi about a partnership, which was exciting enough in this town for a local TV station to run the news story shown below. That video is fun, but if you only have a minute and a half to watch a video about GeoMesa, I recommend the GeoMesa on Google BigTable one, which shows off some of the excellent visualizations that are possible.

In addition to products like GeoMesa and others that you can see on the website, the company does applied research, often for government agencies. (I’m learning a lot about those—did you know that the U.S. has an Office for Anticipating Surprise?) In this era of Big Data, the question sometimes comes up of how to best make use of all this data now that tools for working with such large quantities of it have become more easily available. CCRi’s capabilities such as predictive analytics, optimization, and text analysis are helping customers get more out of this data in settings ranging from international sales patterns to battlefields. If anyone wants to contact me to learn more, I’d be happy to set them up with the right people to tell them about the kinds of services CCRi offers.

13 ways to make your writing look more professional

Bob DuCharme — Tue, 17 Nov 2015 14:35:41 -0500

Simple copyediting things.

The nice thing about these is that, unlike with truly good writing, no skill and very little work is required to put them into practice. They’re all just a matter of paying attention.

I’ve done some copyediting as part of my job, especially with marketing material. Certain basic mistakes come up so often that I made a list that I’ve been tempted to give to whoever gave me the original content and say “please make sure that it doesn’t have any of these problems first!” I didn’t, but for those who are interested, following these simple rules will make your writing look more professional. The nice thing about these is that, unlike with truly good writing, no skill and very little work is required to put them into practice. They’re all just a matter of paying attention.

Never give someone something to read that you haven’t spell checked. If it has typos that a spell checker would have caught, it’s like saying “my time is so much more valuable than yours that I couldn’t bother doing this simple, mechanical two-minute task before giving this to you.” If you’re writing with a tool that doesn’t have a spell checker, paste the text into Microsoft Word or LibreOffice and look for the red squiggly lines. If a spell checker doesn’t recognize a company name and you’re not 100% sure of its spelling, take ten seconds to check it on their website, especially if someone from that company may see the piece.
Only put one space after a period, question mark, or exclamation mark ending a sentence, not two. People used two in the days of manual typewriters for hard copy manuscripts that would be submitted to typesetters, but as with the carriage returns that we formerly added to the end of every single line on typewriters, we now leave it up to the computer to decide how much spacing is appropriate. If you put two spaces after a period, your word processor will put too much space there.
In something published by an American company, punctuation at the end of a quoted phrase goes inside the quotes, “like this,” not outside, “like this”. In the UK they do it outside. This is a stickier issue with technical writing, where you may be referring to specific strings of quoted text; for example, if I write that a password is “swordfish”, I don’t want readers thinking that the comma is part of the password. The important thing is to be consistent within a document.
In a bulleted or numbered list, either end all the bullets with punctuation that treats the bullets as complete sentences or end none of them that way. Don’t do this:
- Go out the front door
- Pull the mail out of the mailbox.
- Bring the mail back inside
- Leave the mail on the dining room table.
The items of a list like that should be grammatically consistent: all complete sentences or all grammatically consistent phrases (for example, all noun phrases) with no complete sentences. For example, if the first item says “Easier setup and installation” and the second says “Wide choice of reports,” then no other items in that list should be complete sentences.
Put consistent spacing around em dashes and don’t confuse them with hyphens. A hyphen is the keyboard character that usually connects words being used together as a single adjective as in user-friendly interface or in-memory database. An em dash (named for being the width of the letter “m”) is used for appositive phrases. It’s often written with two hyphens--like this--which Microsoft Word and LibreOffice will convert to an em dash character. In HTML, you can enter — or just paste the character from somewhere else. (An en dash is a bit narrower and used for date ranges. Handy hint when you’re unloading your last few tiles at the end of a Scrabble game: both em and en are legal words.) Em dashes should either have a space on both sides — like that, or on neither side—like that. Pick one spacing convention and make sure that all the em dashes in a given document are spaced consistently.
Some phrases may or may not use initial caps, like Artificial Intelligence. If you do, capitalize it consistently throughout a document. Don’t refer to Artificial Intelligence in the first paragraph and artificial intelligence in the fourth. Also, with phrases that may or may not be written as one word, pick one and be consistent; don’t write “filename” in one paragraph and “file name” further on in the document. (Early drafts of this blog post made this mistake with “spellcheck.”)
We use apostrophes to stand in for a missing letter in a contraction (such as standing in for the “o” from “is not” in “isn’t”) or for the possessive, as in Jim’s car, so never ever use “it’s” as a possessive—“it’s” can only be used as a contraction for “it is.” Don’t use an apostrophe and an “s” to indicate a plural. (Some people make exceptions for numbers like 1990’s and abbreviations such as M.D.’s.)
Use English instead of Latin abbreviations: “for example” instead of “e.g.” and “that is” instead of “i.e.” Instead of saying “etc.,” introduce a list with “such as” to indicate that the list is incomplete and that there are probably more entries. For example, say “baseball teams such as the Mets, Yankees, and Red Sox” and not “Mets, Yankees, Red Sox, etc.”.
In the age of the web, underlining means hypertext link. Don’t use it for anything else because it clutters a layout. (In the old days, it was an indication to a typesetter to italicize text.) For emphasis, use bold or italics. For example: Never use an apostrophe and an “s” to indicate a plural.
Check that all the links work. As with spell checking, this is best done (or redone) just before sending a document off to someone, because if you do it and then make many other edits, those edits may introduce new problems.
If a product name is trademarked, only put the trademark symbol after the first mention of the product in a document. Here is what one intellectual property attorney tells us:

In written documents — it articles, press releases, promotional materials, and the like — it is only necessary to use a symbol with the first instance of the mark, or with the most prominent placement of the mark. It is a common misconception that each and every instance of the mark should bear a trademark symbol. Overuse creates visual clutter and may detract from the aesthetic appeal of the piece. Provided there is at least one conspicuous use of the TM, SM, or ® on the face of the writing, do not be afraid to eliminate superfluous markings.
Don’t say “and/or.” If necessary, rewrite the sentence. In general, the use of slashes to indicate indecision is a bad idea. Decide on something, or rewrite the sentence.

2016-12-23 update: In her article Lessons from a year’s worth of hiring data, Aline Lerner demonstrates her surprising finding that the fewer grammatical and spelling mistakes software developers made on their resumes, the more likely they were to be worth hiring. See her Number of errors example in which many of the rules I’ve listed above are broken.

Data wrangling, feature engineering, and dada

Bob DuCharme — Sat, 17 Oct 2015 09:58:52 -0500

And surrealism, and impressionism...

In my data science glossary, the entry for data wrangling gives this example: “If you have 900,000 birthYear values of the format yyyy-mm-dd and 100,000 of the format mm/dd/yyyy and you write a Perl script to convert the latter to look like the former so that you can use them all together, you’re doing data wrangling.” Data wrangling isn’t always cleanup of messy data, but can also be more creative, downright fun work that qualifies as what machine learning people call “feature engineering,” which Charles L. Parker described as “when you use your knowledge about the data to create fields that make machine learning algorithms work better.” In other words, you’re creating new fields (or features, or properties, or attributes, depending on your modeling frame of mind) from existing data to let systems do more with that data.

New York’s Museum of Modern Art released metadata about their complete collection on github, and I recently had a great time doing some data wrangling with it. I managed to transform the data so that it could answer interesting questions such as “who are the youngest painters in MoMA’s collection?” and “on average, which country’s painters make the biggest paintings?” Neither of these questions could be answered with a query against their original data.

I enjoyed working with this data so much because I went to MoMA pretty regularly during my years in New York City. In addition to iconic paintings such as Picasso’s Demoiselles d’Avignon, Dalí’s Persistence of Memory, and van Gogh’s The Starry Night, they have many key works by my own favorites such as Marcel Duchamp and Man Ray. My wife and I were members there for several years, which let us go to the members’ special openings of some exhibits, and through a friend of hers we sometimes got to go to the more exclusive pre-members’ openings where we’d see celebrities such as Chuck Close and David Bowie.

The data

The data on github is a comma-separated value file with 123,920 rows and 14 columns that have labels across the top such as “ArtistBio”, “Medium”, and “Dimensions”. The feature engineering fun comes from looking in the more descriptive fields to find patterns that identify pieces of data that can be stored on their own with more structure so that they’re easier to query. For example, the smaller of their two Monet Water Lilies paintings has a “Dimensions” value of “6’ 6 1/2” x 19’ 7 1/2" (199.5 x 599 cm)" and Man Ray’s assemblage Indestructible Object (or Object to Be Destroyed) has a value of “8 7/8 x 4 3/8 x 4 5/8” (22.5 x 11 x 11.6 cm)". Along with that optional third dimension, other variations in this column include the use of the symbol “×” instead of the letter “x” and descriptive additions such as “Approx.” (174 works) or “irregular” (101).

I wrote a Python script that churned through this data and used regular expressions to pull individual pieces of information from several different fields. (Regular expressions, also known as regexes, offer ways to look for patterns in data such as “four numeric digits followed by optional space, a hyphen, optional space, and then either two or four digits”. O’Reilly has a whole book about them.) For the Dimensions field, my script pulled out the metric width, height, and, if included, the depth and descriptive note. My script, available with the resulting data on github, converts all the input fields and new data to RDF so that I could query it with SPARQL. For example, when writing the previous paragraph, I knocked out some quick SPARQL queries to find that the script had pulled “Approx.” from the Dimensions data 174 times and “irregular” 101 times.

I considered also outputting the results to a new CSV table with additional columns for the extracted properties, but when an artist like Elizabeth Catlett is listed as both American and Mexican, I wanted to output these two separate facts about her, which would require two columns or a separate artist nationality table to handle artists with multiple values for this field. This would be a pain with table-based data, but of course, it’s not an issue with RDF.

Artist nationalities came from the CSV file’s ArtistBio column, which had simple descriptions such as “(Swiss, born 1943)” and more complex ones such as “(French and Swiss, born Switzerland 1944)” and “(American, born Germany. 1886-1969)”. For each work’s artist, my Python script’s regular expressions pulled out nationality values, where they were born if specified, their birth years, and their death years (if specified) into separate RDF triples.

Not counting the header row and blank cells, the MoMA CSV file has 1,625,710 pieces of information in it. The resulting RDF has 2,364,277 triples, so it’s clearly much richer.

Queries to play with the new data

I could make many interesting queries against the original CSV values that were converted to triples with no manipulation, but the value of this feature engineering is clearer if we look at queries that take advantage of the new, extracted data. (For those interested in the geekier details, each bullet below links to the actual SPARQL query and results.) You’ll see that a common theme among the queries is doing a bit of arithmetic with numeric values extracted from the more descriptive CSV values, such as multiplying height by width to determine a work’s area.

What’s the single largest painting? At 798,972 square cm, James Rosenquist’s F-111. I knew of and had seen this work, but didn’t realize until looking at his Wikipedia page just now that F-111 was how this important sixties pop artist first came to the art world’s attention.
What’s the largest photograph? Mariah Robertson’s 11, which uses a thirty-inch-wide one-hundred-foot roll of photographs as part of a three-dimensional work. (I might not consider this a “Photograph”, but that is its Classification value in the original CSV data.)
What’s the largest three-dimensional work? The 1994 installation Stations by Bill Viola, who first became known as a video artist. (The piece includes five video projections.)
How many painters come from each country? No surprise that the U.S. leads with 494 artists, followed by the French, German, British, and other European countries until you get to Argentina in seventh place and Japan in eighth. The full list has 52 countries, and I thought Argentina’s high placement was interesting; off the top of my head I can’t name a single artist from that country.
What’s the average painting size by country? This query filters out countries with less than eleven paintings in the collection to increase the chance of getting a representative sampling, and again it’s not a surprise that the U.S. leads with an average painting size of 28,244 square cm. (I’m sure Rosenquist helped here.) The next few are Germany, Britain, Japan, and Italy, all with average sizes over 20,000 square cm. The Russians have the smallest paintings, with the 32 of them having an average size of 6,758 square cm. I’m sure that closer analysis would find smaller or larger sizes to be favored by particular artists who are well-represented in MoMA’s collection and skewing the average for their countries.
What are the oldest pieces in the collection and who made them? Besides a brocade from 1600 by “unknown”, there are four “Black basalt with glazed interior” works dated 1768 such as this sugar bowl. These are pretty old for a museum of modern art, but if you look at any of them you’ll see why they fit right into the collection. And, they’re credited to a familiar name: Josiah Wedgwood, founder of the company that bears his name.
Who are the five youngest painters with work in the collection? One work apparently co-credited to two artists gives us a total of six names, all born in the eighties, and none of whom I’ve heard of.

Most of these queries focus on work in specific media because broader versions often ran into data anomalies that led to odd answers. For example, a query for the work in the collection took the longest to create showed several photographs that apparently took over a hundred years. I assume that the elapsed time represented the span between the exposure of the negatives and the creation of the prints in MoMA’s collection. A query for the oldest living artist seemed simple enough–just look for the earliest birth year with no corresponding death year, but it turned out that there was no death date recorded for one artist born in 1731. (Sometimes the data has question marks as a birth or death date, but I didn’t want to store those in a property that I’d use to perform arithmetic.) A query about the youngest artist in the whole collection found that it was someone named “Technology will save us” born in 2012–clearly a collective founded in that year and not a person. Also, since all artist names and information are properties of a “work”, an artist whose name is spelled two different ways will be considered as two different artists with the current setup.

Other odd answers led to tweaks to the regular expressions and other logic in the data conversion and queries, but at some point, unless someone’s paying you otherwise, you’ve got to quit and make the best you can of what you have. (On this topic, I highly recommend Jeni Tennison’s classic Five Stages of Data Grief.)

Even if my script doesn’t create perfect data about every work in MoMA’s collection, the data it creates still offers plenty to query. I think it demonstrates pretty nicely how data wrangling techniques such as the use of regular expressions–in addition to cleaning up messes such as badly formatted data–can do the kind of feature engineering that improve a dataset to make it even more useful.

Photo of Man Ray’s “Indestructible Object (or Object to be Destroyed)” by Chris Barker via Flickr (CC BY-NC-ND 2.0)

My data science glossary

Bob DuCharme — Sat, 19 Sep 2015 10:23:04 -0500

Complete with a dot org domain name.

Lately I’ve been studying up on the math and technology associated with data science because there are so many interesting things going on. Despite taking many notes, I found myself learning certain important terms, seeing them again later, and then thinking “What was that again? P-values? Huh?”

So, I turned a portion of my notes into a glossary to make these things easy to look up when I wanted to remember them. I decided that I may as well publish this glossary in case others found it helpful, or if they had suggestions or corrections. And, when I found that the domain name datascienceglossary.org wasn’t taken, I couldn’t resist grabbing it.

Now it’s up and ready for the world: datascienceglossary.org. I also took the opportunity to try out Bootstrap to see how easily it might make my new little website look presentable on Android and Apple phones and tablets in addition to bigger screens. It was pretty easy, especially after I found their documentation page. (In the past, I’ve found that many CSS frameworks that are supposed to make your life easier have horrible if any documentation–“just look out our fabulous examples” isn’t enough; if the class values that we’re supposed to assign to our HTML elements are packed with cryptic little abbreviations, then tell us what all the abbreviations stand for.)

I hope my data science glossary is useful to some people. I know it will be useful to me, especially the next time I forget what “P-value” means.

Querying machine learning movie ratings data with SPARQL

Bob DuCharme — Sat, 22 Aug 2015 10:10:21 -0500

Well, movie ratings data popular with machine learning people.

I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a nice addition SPARQL can be to their tool box.

While watching an excellent video about the pandas python data analysis library recently, I learned about how the University of Minnesota’s grouplens project has made a large amount of movie rating data from the movielens website available. Their download page lets you pull down 100,000, one million, ten million, or 100 million ratings, including data about the people doing the rating and the movies they rated.

This dataset is popular in the machine learning world: a Google search on “movielens ‘machine learning’” gets over 33 thousand hits, with over ten percent being in scholarly articles. I thought it would be fun to query this data with SPARQL, so I downloaded the 1 million rating set, wrote some short perl scripts to convert the ratings, users, and movies “tables” to turtle RDF, and was off and running.

The data

I put “tables” in quotes above because while most people like to think of data in terms of tables, the data about the movies themselves was not strictly a normalized table. As the README file tells us, each line has the structure “MovieID::Title::Genres”, in which Genres is a pipe-delimited list of one or more genres selected from the list in the README file. Here’s one example:

3932::Invisible Man, The (1933)::Horror|Sci-Fi

The potential presence of more than one genre value in that last column means that this table’s data is not fully normalized, but speaking as an RDF guy, we don’t need no stinkin’ normalization. A short perl script converted that line into the following turtle:

gldm:i3932 rdfs:label "The Invisible Man" ;
   a schema:Movie ;
   dcterms:type "Horror" ;
   dcterms:type "Sci-Fi" ;
   schema:datePublished "1933" .

As you can see, my perl script also moved the word “The” in the film’s title back where it belonged and pulled the release date out into its own triple, which let me query for things like the effect of a movie’s age on its popularity among viewers. Although the 3,883 movies listed went back to 1919, most were from the 1990s.

Something else from the 1990s was the movie file’s Latin 1 encoding, so I used the iconv utility to convert it to UTF-8 before running the script that turned it into turtle so that a title such as “Not Love, Just Frenzy (Más que amor, frenesí)” wouldn’t get mangled along the way.

A simpler perl script converted user descriptions of the format “UserID::Gender::Age::Occupation::Zip-code” to triples like this:

gldu:i48 a schema:Person ;
   schema:gender "M" ;
   glschema:age glda:i25 ;
   schema:jobTitle gldo:i4 ;
   schema:postalCode "92107" .

I created a ratingsSchemaAndCodeLists.ttl file to assign the age range and job title labels shown in the README file to the age and jobTitle values with triples like this:

glda:i25 rdfs:label "25-34" . 
gldo:i4 a schema:jobTitle ;
   rdfs:label "college/grad student" .

Finally, a third perl script converted ratings lines of the format “UserID::MovieID::Rating::Timestamp” to triples grouped together with blank nodes like this:

[
  a schema:Review ;
  schema:author gldu:i1 ;
  schema:about gldm:i661 ;
  schema:reviewRating 3 ;
  dcterms:date "2000-12-31" 
] .

The scripts and the ratingsSchemaAndCodeLists.ttl file are available on github, and you can see the queries described below and their results at movieLensQueries.html.

The queries

I mentioned that most of the movies were from the 1990s; the results of query 1 show the actual number of rated movies by release year.

Query 2 listed the movie genres sorted by the average ratings they received. The results put Film-Noir, Documentary, War, and Drama in the top four spots. Does that make these four genres the most popular? Perhaps, if you measure popularity by assigned ratings, but if you measure it by the movies that people actually choose to see (or, more accurately, to rate), as query 3 does, the results reveal that the four most popular genres to see are Comedy, Drama, Action, and Thrillers, with Film-Noir and Documentary ranking in the bottom two spots.

Breaking ratings down by age group makes things more interesting. Query 4 asks for average ratings by age group, and the results show a strong correlation between age and ratings: while movie viewers aged 18-24 give slightly lower ratings than those under 18–it is a cynical age to be–from there on up, the older the viewers, the higher the average ratings.

What are each age group’s favorite genres by rating and by attendance? Query 5 asks for attendance figures and average ratings broken down by age group and genres. In the first version of these results, sorted by rating, we see that most age groups give the highest average ratings to Film Noir, Documentary, and War movies, in that order, except the two oldest groups, who rate War movies higher than Documentaries, and the youngest group, whose average rating for Documentary films puts them behind Film-Noir, War and Drama.

With the same results sorted by attendance within each age group, we see that the three age groups under 35 prefer to watch Comedy, Drama, and Action movies, in that order. Most people 35 and older would rather watch Drama than Comedies, with Action in third place for them as well.

I was curious whether a movie’s age affected viewers’ choices of what to see and their ratings–for example, when watching a movie that you’ve heard about for a few years, are you more likely to assume that it’s good because it hasn’t faded away? Query 6 lists the average ratings given to movies by movie type if the movie was seen more than five years after release. In these results, Film Noir is once again at the top, but the average rating of War movies puts them above Documentaries, and Mysteries climb from seventh to fourth place.

Query 7 asks the same thing about movies that were ten years old when viewed. These results show Mysteries climbing to third place and pushing Documentaries down to fourth, so it appears that Mysteries age better than Documentaries. (Nothing ages better than Film-Noir, whose average ratings go up with age, but remember that they’re not nearly as popular to watch as the other genres; people who like them just like them more.)

Finally, Query 8 asks for the average ratings and total attendance by age group for the movies that were more than ten years old when viewed. Comparing the results sorted by rating with the same figures calculated for all movies (the first query 5 results), we see that it’s the older movie viewers driving the higher ratings of older Mysteries over Documentaries–the ratings of the 199 movie viewers aged 18-24 actually put Documentaries at the top of their list of older movies. The same results sorted by attendance were remarkably similar to the query 5 version that took all the movies into account.

And more queries

It’s easy to think of more questions to ask; we haven’t even asked about about specific movies and their roles in the ratings. For example, what were these older Documentaries that the 18-24 year-old viewers liked so much? Perhaps there was some breakout hit that skewed the averages by being more popular than Documentaries typically are. Do viewers’ genders or job titles affect their choice of movies to see or the ratings they gave them? If you’re wondering, or thinking of new queries, you can download the data from the grouplens link above, convert it to turtle with my perl scripts, and query away.

With more recent ratings and movies, these kinds of explorations of the data could be used to plan advertising budgets or a film festival program. I mostly found it fun as a way to use SPARQL to explore a set of data that was not designed to be represented in RDF, but was very easy to convert, and I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a great addition SPARQL can be to their tool box.

Visualizing DBpedia geographic data

Bob DuCharme — Wed, 15 Jul 2015 08:34:43 -0500

With some help from SPARQL.

I’ve been learning about Geographical Information System (GIS) data lately. More and more projects and businesses are doing interesting things by associating new kinds of data with specific latitude/longitude pairs; this data might be about air quality, real estate prices, or the make and model of the nearest Uber car.

DBpedia has a lot of latitude and longitude data, and SPARQL queries let you associate it with other data. Because you can retrieve these query results as CSV files, and many GIS packages can read CSV data, you can do a lot of similar interesting things yourself.

A query of DBpedia data about American astronauts shows that the oldest one was born in 1918 and the youngest one was born in 1979. I wondered whether, over time, there were any patterns in what part of the country they came from, and I managed to combine a DBpedia SPARQL query with an open-source GIS visualization package to create the map shown here.

The following query asks for the birth year and latitude and longitude of the birthplace of each American astronaut:

SELECT (MAX(?latitude) AS ?maxlat) (MAX(?longitude) AS ?maxlong) 
       ?astronaut (substr(str(MAX(?birthYear)),1,4) AS ?by) 
  WHERE {
  ?astronaut dcterms:subject category:American_astronauts ;
             dbpedia-owl:birthPlace ?birthPlace ;
             dbpedia-owl:birthYear ?birthYear ; 
              dbpedia2:nationality :United_States .  
  ?birthPlace geo:lat ?latitude ;
              geo:long ?longitude . 
}
GROUP BY ?astronaut

(The query has no prefix declarations because it uses the ones built into DBpedia. Also, because some places have more than one pair of geo:lat and geo:long values, I found it simplest to just take the maximum value of each to get one pair for each astronaut.) The following shows the first few lines of the result when I asked for CSV:

"maxlat","maxlong","astronaut","by"
37.195,-93.2861,"http://dbpedia.org/resource/Janet_L._Kavandi","1959"
42.6461,-83.2925,"http://dbpedia.org/resource/Brent_W._Jett,_Jr.","1958"
40.1,-75.0997,"http://dbpedia.org/resource/John-David_F._Bartoe","1944"

QGIS Desktop is an open-source tool for working with GIS data that, among other things, lets you visualize data. The data can come from disk files or from several other sources, including the PostGIS add-on to the PostgreSQL database, which lets you scale up pretty far in the amount of data you can work with.

Using QGIS to create the image above, I first loaded the shapefile (actually a collection of files, including an old-fashioned dBase dbf file) from the US Census website with outlines of the individual states of the United States.

GIS visualization is often about layering of data such as state boundaries, altitude data, and roads to see the combined effects; those little cars in your phone’s Uber app would like kind of silly if the roads and your current location weren’t shown with them. For my experiment, the census shapefile was my first layer, and QGIS Desktop’s “Add Delimited Text Layer” feature let me add the results of my SPARQL query about astronaut data as another layer. One tricky bit for us GIS novices is that these tools usually ask you to specify a Coordinate Reference System for any set of data, typically as an EPSG number, and there are a lot of those out there. I used EPSG 4269.

At first, QGIS added in all the astronaut birthplace locations as little black circles filled with the same shade of green. It had also set the default fill color of the US map to green, so I reset that to white in the dialog box for configuring that layer’s properties. Then, in the astronaut data layer’s properties, I found that instead of using identical symbols to represent each point on the map, I could pick “Graduated” and specify a “color ramp” that QGIS would use to assign color values according to the values in the property that I selected for this: by, or birth year, which you’ll recognized from the fourth column of the sample CSV output above. QGIS looked at the lowest and highest of these values and offered to assign the following colors to by values in the ranges shown, and I just accepted the default:

(While the earlier query showed a few astronauts born in 1978 and 1979, the range here only goes up to 1977 because I now see that some geographic coordinates in DBpedia are specified with dbpprop:latitude and dbpprop:longitude instead of geo:lat and geo:long, so if I was redoing this I’d revise the query to take those into account.)

If you click on the map above to see the larger image, you’ll see that many early astronauts came from the midwest, and then over time, they gradually came from the four corners of the continental US. Why so many from the New York City area and none from Wyoming? Is there something in New York more conducive to producing astronauts than the wide-open spaces of Wyoming? Yes: there are more people there, so the odds are that more astronauts will come from there. See this excellent xkcd cartoon for more on this principle.

I only scratched the surface of what QGIS can do. I found this video from the Vermont Center for Geographic Info to be an excellent introduction. I learned from it and the book PostGIS in Action that an important set of features that GIS systems such as QGIS add is the automation of some of the math involved in computing distances and areas, which is not simple geometry because it takes place on the curved surface of the earth. A package like PostGIS adds specialized datatypes and functions to a general-purpose database like PostgreSQL to do the more difficult parts of the geography math. This lets your SQL queries do proximity analysis and other GIS tasks as well as handing off of such data to a visualization tool such as QGIS. (The open-source GeoMesa database adds similar features to Apache Accumulo and Google BigTable for more Hadoop-scale applications.)

The great news for SPARQL users is that a GIS extension called GeoSPARQL does something similar. You can try it out at the geosparql.org website. For example, entering the following query there will list all the airports within 10 miles of New York City:

PREFIX spatial:<http://jena.apache.org/spatial#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geo:<http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX gn:<http://www.geonames.org/ontology#>


Select ?name 
WHERE{
  ?object spatial:nearby(40.712700 -74.005898 10 'mi').
  ?object a <http://www.lotico.com/ontology/Airport> ;
  gn:name ?name 
}

(The data uses a fairly broad definition of “airport,” including heliports and seaplane bases.) I have not played with any GeoSPARQL implementations outside of geosparql.org, but the Parliament one mentioned on the GeoSPARQL wikipedia page looks interesting. I have not played much with the Linked Open Streeet Map SPARQL endpoint, but it also looks great for people who interested in GIS and SPARQL.

Whether you try out GeoSPARQL or not, when you take DBpedia’s ability to associate such a broad range of data with geographic coordinates, and you combine that with the ability of GIS visualization tools like QGIS to work with that data (especially the ability to visualize the associated data—in my case, the color coding of astronaut birth years), you have a vast new category of cool things you can do with SPARQL.

Artificial Intelligence, then (1960) and now

Bob DuCharme — Sat, 20 Jun 2015 10:50:25 -0500

Especially machine learning.

It's fascinating how relevant much of [this 1960 paper] still is today, especially considering the limited computing power available 55 years ago.

Earlier this month I tweeted “When people write about AI like it’s this brand new thing, should I be amused, feel old, or both?” The tweet linked to a recent Harvard Business Review article called Data Scientists Don’t Scale about the things that Artificial Intelligence is currently doing, which just happened to be the things that the author of the article’s automated prose-generation company is doing.

The article provided absolutely no historical context to this phrase that has thrilled, annoyed, and fascinated people since the term was first coined by John McCarthy in 1955. (For a little historical context, this was two years after Dwight Eisenhower succeeded Harry Truman as President of the United States. Three years later, McCarthy invented Lisp—a programming language that, besides providing the basis of other popular languages such as Scheme and the currently very hot Clojure, is still used today.) I recently came across a link to the seminal 1960 paper Steps Toward Artificial Intelligence by AI pioneer Marvin Minsky, who was there at the beginning in 1955, and so I read it on a long plane ride. It’s fascinating how relevant much of it still is today, especially when you take into account the limited computing power available 55 years ago.

After enumerating the five basic categories of “making computers solve really difficult problems” (search, pattern-recognition, learning, planning, and induction), the paper mentions several algorithms that are still considered to be basic tools in Machine Learning toolboxes: hill climbing, naive Bayesian classification, perceptrons, reinforcement learning, and neural nets. He mentions that one part of Bayesian classification “can be made by a simple network device” that he illustrates with this diagram:

It’s wild to consider that the software possibilities were so limited at the time that implementing some of these ideas were easier by just building specialized hardware. Minksy also describes the implementation of a certain math game by a network of resistors as designed by Claude Shannon (who I was happy to hear mentioned in the season 1 finale of Silicon Valley):

Minsky’s paper also references the work of B.F. Skinner, of Skinner box fame, when describing reinforcement learning, and it cites Noam Chomsky when describing inductive learning. I mention these two together because this past week I also read an interview that took place just three years ago titled Noam Chomsky on Where Artificial Intelligence Went Wrong. Describing those early days of AI research, the interview’s introduction tells us how

Some of McCarthy’s colleagues in neighboring departments, however, were more interested in how intelligence is implemented in humans (and other animals) first. Noam Chomsky and others worked on what became cognitive science, a field aimed at uncovering the mental representations and rules that underlie our perceptual and cognitive abilities. Chomsky and his colleagues had to overthrow the then-dominant paradigm of behaviorism, championed by Harvard psychologist B.F. Skinner, where animal behavior was reduced to a simple set of associations between an action and its subsequent reward or punishment. The undoing of Skinner’s grip on psychology is commonly marked by Chomsky’s 1959 critical review of Skinner’s book Verbal Behavior, a book in which Skinner attempted to explain linguistic ability using behaviorist principles.

The introduction goes on to describe a 2011 symposium at MIT on “Brains, Minds and Machines,” which “was meant to inspire multidisciplinary enthusiasm for the revival of the scientific question from which the field of artificial intelligence originated: how does intelligence work?”

Noam Chomsky, speaking in the symposium, wasn’t so enthused. Chomsky critiqued the field of AI for adopting an approach reminiscent of behaviorism, except in more modern, computationally sophisticated form. Chomsky argued that the field’s heavy use of statistical techniques to pick regularities in masses of data is unlikely to yield the explanatory insight that science ought to offer. For Chomsky, the “new AI” — focused on using statistical learning techniques to better mine and predict data — is unlikely to yield general principles about the nature of intelligent beings or about cognition.

The whole interview is worth reading. I’m not saying that I completely agree with Chomsky or completely disagree (as Google’s Peter Norvig has in an essay that has the excellent URL http://norvig.com/chomsky.html but gets a little ad hominem when he starts comparing Chomsky to Bill O’Reilly), only that Minsky’s 1960 paper and Chomsky’s 2012 interview, taken together, provide a good perspective on where AI came from and the path it took to the roles it play today.

I’ll closed with this nice quote from a discussion in Minsky’s paper of what exactly “intelligence” is and whether machines are capable of it:

Programmers, too, know that there is never any “heart” in a program. There are high-level routines in each program, but all they do is dictate that “if such-and-such, then transfer to such-and-such a subroutine.” And when we look at the low-level subroutines, which “actually do the work,” we find senseless loops and sequences of trivial operations, merely carrying out the dictates of their superiors. The intelligence in such a system seems to be as intangible as becomes the meaning of a single common word when it is thoughtfully pronounced over and over again.

SPARQL: the video

Bob DuCharme — Sun, 03 May 2015 16:15:07 -0500

Well, a video, but a lot of important SPARQL basics in a short period of time.

While doing training for a TopQuadrant customer recently, the schedule led to my having ten minutes to explain the basics of writing SPARQL queries. I think I did OK, but on the plane home I thought harder about what to put in those ten minutes, which led to my making the video SPARQL in 11 minutes. While the video is 11 minutes and 14 seconds long, between the opening part about RDF and the plug for Learning SPARQL at the end, the SPARQL introduction is less than eight minutes.

After explaining what RDF triples are and how they’re represented in Turtle, the video walks through some simple SELECT queries and how they work with the data. This leads up to a CONSTRUCT query and a list of other things that people will find useful if they learn more about SPARQL. I had a lot of fun making the video’s SPARQL engine noise with my Korg Monotron synthesizer and also making more traditional music for the introduction and ending.

I hope this video is helpful for people who are new to SPARQL. The other SPARQL videos on YouTube are mostly real-time classroom lectures. My favorite is an ad for what seems like a Dutch cable TV provider that has nothing to do with the query language but has the excellent domain name sparql.nl. If you skip ahead to 1:03 of this ad for the company, you’ll see a finger snap turn into a swirl of flames and then their shining “sparql” logo, all with the most dramatic music possible. My production values were not quite that high, but higher than most of the other SPARQL videos you’ll find on YouTube.

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

Bob DuCharme — Sun, 12 Apr 2015 09:55:45 -0500

Well, one algorithm, but a very cool one.

(This blog entry has also been published on the databricks company blog.)

Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark’s GraphX library lets you do this kind of computing on graph data structures and how I had some ideas for using it with RDF data. My goal was to use RDF technology on GraphX data and vice versa to demonstrate how they could help each other, and I demonstrated the former with a Scala program that output some GraphX data as RDF and then showed some SPARQL queries to run on that RDF.

Today I’m demonstrating the latter by reading in a well-known RDF dataset and executing GraphX’s Connected Components algorithm on it. This algorithm collects nodes into groupings that connect to each other but not to any other nodes. In classic Big Data scenarios, this helps applications perform tasks such as the identification of subnetworks of people within larger networks, giving clues about which products or cat videos to suggest to those people based on what their friends liked.

The US Library of Congress has been working on their Subject Headings metadata since 1898, and it’s available in SKOS RDF. Many of the subjects include “related” values; for example, you can see that the subject Cocktails has related values of Cocktail parties and Happy hours, and that Happy hours has related values of Bars (Drinking establishments), Restaurants, and Cocktails. So, while it includes skos:related triples that indirectly link Cocktails to Restaurants, it has none that link these to the subject of Space stations, so the Space stations subject is not part of the same Connected Components subgraph as the Cocktails subject.

After reading the Library of Congress Subject Header RDF into a GraphX graph and running the Connected Components algorithm on the skos:related connections, here are some of the groupings I found near the beginning of the output:

"Hiding places" 
"Secrecy" 
"Loneliness" 
"Solitude" 
"Privacy" 
--------------------------
"Cocktails" 
"Bars (Drinking establishments)" 
"Cocktail parties" 
"Restaurants" 
"Happy hours" 
--------------------------
"Space stations" 
"Space colonies" 
"Large space structures (Astronautics)" 
"Extraterrestrial bases" 
--------------------------
"Inanna (Sumerian deity)" 
"Ishtar (Assyro-Babylonian deity)" 
"Astarte (Phoenician deity)" 
--------------------------
"Cross-cultural orientation" 
"Cultural competence" 
"Multilingual communication" 
"Intercultural communication" 
"Technical assistance--Anthropological aspects" 
--------------------------

(You can find the complete output here, a 565K file.) People working with RDF-based applications already know that this kind of data can help to enhance search. For example, someone searching for media about “Space stations” will probably also be interested in media filed under “Space colonies” and “Extraterrestrial bases”. This data can also help other applications, and now, it can help distributed applications that use Spark.

Storing RDF in GraphX data structures

First, as I mentioned in the earlier blog entry, GraphX development currently means coding with the Scala programming language, so I have been learning Scala. My old friend from XML days Tony Coates wrote A Scala API for RDF Processing, which takes better advantage of native Scala data structures than I ever could, and the banana-rdf Scala library also looks interesting, but although I was using Scala my main interest was storing RDF in Spark GraphX data structures, not in Scala particularly.

The basic Spark data structure is the Resilient Distributed Dataset, or RDD. The graph data structure used by GraphX is a combination of an RDD for vertices and one for edges. Each of these RDDs can have additional information; the Spark website’s Example Property Graph includes (name, role) pairs with its vertices and descriptive property strings with its edges. The obvious first step for storing RDF in a GraphX graph would be to store predicates in the edges RDD, subjects and resource objects in the vertices RDD, and literal properties as extra information in these RDDs like the (name, role) pairs and edge description strings in the Spark website’s Example Property Graph.

But, as I also wrote last time, a hardcore RDF person would ask these questions:

What about properties of edges? For example, what if I wanted to say that an xp:advisor property was an rdfs:subPropertyOf the Dublin Core property dc:contributor?
The ability to assign properties such as a name of “rxin” and a role of “student” to a node like 3L is nice, but what if I don’t have a consistent set of properties that will be assigned to every node—for example, if I’ve aggregated person data from two different sources that don’t use all the same properties to describe these persons?

The Example Property Graph can store these (name, role) pairs with the vertices because that RDD is declared as RDD[(VertexId, (String, String))]. Each vertex will have two strings stored with it; no more and no less. It’s a data structure, but you can also think of it as a proscriptive schema, and the second bullet above is asking how to get around that.

I got around both issues by storing the data in three data structures—the two RDDs described above and one more:

For the vertex RDD, along with the required long integer that must be stored as each vertex’s identifier, I only stored one extra piece of information: the URI associated with that RDF resource. I did this for the subjects, the predicates (which may not be “vertices” in the GraphX sense of the word, but damn it, they’re resources that can be the subjects or objects of triples if I want them to), and the relevant objects. After reading the triple { <http://id.loc.gov/authorities/subjects/sh85027617> <http://www.w3.org/2004/02/skos/core#related> <http://id.loc.gov/authorities/subjects/sh2009010761>} from the Library of Congress data, the program will create three vertices in this RDD whose node identifiers might be 1L, 2L, and 3L, with each of the triple’s URIs stored with one of these RDD vertices.
For the edge RDD, along with the required two long integers identifying the vertices at the start and end of the edge, each of my edges also stores the URI of the relevant predicate as the “description” of the edge. The edge for the triple above would be (1L, 3L, http://www.w3.org/2004/02/skos/core#related).
To augment the graph data structure created from the two RDDs above, I created a third RDD to store literal property values. Each entry stores the long integer representing the vertex of the resource that has the property, a long integer representing the property (the integer assigned to that property in the vertex RDD), and a string representing the property value. For the triple { <http://id.loc.gov/authorities/subjects/sh2009010761> <http://www.w3.org/2004/02/skos/core#prefLabel> "Happy hours"} it might store (3L, 4L, “Happy hours”), assuming that 4L had been stored as the internal identifier for the skos:prefLabel property. To run the Connected Components algorithm and then output the preferred label of each member of each subgraph, I didn’t need this RDD, but it does open up many possibilities for what you can do with RDF in an a Spark GraphX program.

Creating a report on Library of Congress Subject Heading connecting components

After loading up these data structures (plus another one that allows quick lookups of preferred labels) my program below applies the GraphX Connected Components algorithm to the subset of the graph that uses the skos:related property to connect vertices such as “Cocktails” and “Happy hours”. Iterating through the results, it uses them to load a hash map with a list for each subgraph of connected components. Then, it goes through each of these lists, printing the label associated with each member of each subgraph and a string of hyphens to show where each list ends, as you can see in the excerpt above.

I won’t go into more detail about what’s in my program because I commented it pretty heavily. (I do have to thank my friend Tony, mentioned above, for helping me past one point where I was stuck on a Scala scoping issue. Also, as I’ve warned before, my coding style will probably make experienced Scala programmers choke on their Red Bull. I’d be happy to hear about suggested improvements.)

After getting the program to run properly with a small subset of the data, I ran it on the 1 GB subjects-skos-2014-0306.nt file that I downloaded from the Library of Congress with its 7,705,147 triples. Spark lets applications scale up by giving you an infrastructure to distribute program execution across multiple machines, but the 8GB on my single machine wasn’t enough to run this, so I used two grep commands to create a version of the data that only had the skos:related and skos:prefLabel triples. At this point I had a total of 439,430 triples. Because my code didn’t account for blank nodes, I removed the 385 triples that used them, leaving 439,045 to work with in a 60MB file. This ran successfully and you can follow the link shown earlier to see the complete output.

Other GraphX algorithms to run on your RDF data

Other GraphX algorithms besides Connected Components include Page Rank and Triangle Counting. Graph theory is an interesting world, in which my favorite phrase so far is “strangulated graph”.

One of the greatest things about RDF and Linked Data technology is the growing amount of interesting data being made publicly available, and with new tools such as these algorithms to work with this data—tools that can be run on inexpensive, scalable clusters faster than typical Hadoop MapReduce jobs—there are a lot of great possibilities.

//////////////////////////////////////////////////////////////////
// readLoCSH.scala: read Library of Congress Subject Headings into
// Spark GraphX graph and apply connectedComponents algorithm to those
// connected by skos:related property.


import scala.io.Source 
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
import scala.collection.mutable.HashMap


object readLoCSH {


    val componentLists = HashMap[VertexId, ListBuffer[VertexId]]()
    val prefLabelMap =  HashMap[VertexId, String]()


    def main(args: Array[String]) {
        val sc = new SparkContext("local", "readLoCSH", "127.0.0.1")


        // regex pattern for end of triple
        val tripleEndingPattern = """\s*\.\s*$""".r    
        // regex pattern for language tag
        val languageTagPattern = "@[\\w-]+".r    


        // Parameters of GraphX Edge are subject, object, and predicate
        // identifiers. RDF traditionally does (s, p, o) order but in GraphX
        // it's (edge start node, edge end node, edge description).


        // Scala beginner hack: I couldn't figure out how to declare an empty
        // array of Edges and then append Edges to it (or how to declare it
        // as a mutable ArrayBuffer, which would have been even better), but I
        // can append to an array started like the following, and will remove
        // the first Edge when creating the RDD.


        var edgeArray = Array(Edge(0L,0L,"http://dummy/URI"))
        var literalPropsTriplesArray = new Array[(Long,Long,String)](0)
        var vertexArray = new Array[(Long,String)](0)


        // Read the Library of Congress n-triples file
        //val source = Source.fromFile("sampleSubjects.nt","UTF-8")  // shorter for testing
        val source = Source.fromFile("PrefLabelAndRelatedMinusBlankNodes.nt","UTF-8")


        val lines = source.getLines.toArray


        // When parsing the data we read, use this map to check whether each
        // URI has come up before.
        var vertexURIMap = new HashMap[String, Long];


        // Parse the data into triples.
        var triple = new Array[String](3)
        var nextVertexNum = 0L
        for (i <- 0 until lines.length) {
            // Space in next line needed for line after that. 
            lines(i) = tripleEndingPattern.replaceFirstIn(lines(i)," ")  
            triple = lines(i).mkString.split(">\\s+")       // split on "> "
            // Variables have the word "triple" in them because "object" 
            // by itself is a Scala keyword.
            val tripleSubject = triple(0).substring(1)   // substring() call
            val triplePredicate = triple(1).substring(1) // to remove "<"
            if (!(vertexURIMap.contains(tripleSubject))) {
                vertexURIMap(tripleSubject) = nextVertexNum
                nextVertexNum += 1
            }
            if (!(vertexURIMap.contains(triplePredicate))) {
                vertexURIMap(triplePredicate) = nextVertexNum
                nextVertexNum += 1
            }
            val subjectVertexNumber = vertexURIMap(tripleSubject)
            val predicateVertexNumber = vertexURIMap(triplePredicate)


            // If the first character of the third part is a <, it's a URI;
            // otherwise, a literal value. (Needs more code to account for
            // blank nodes.)
            if (triple(2)(0) == '<') { 
                val tripleObject = triple(2).substring(1)   // Lose that <.
                if (!(vertexURIMap.contains(tripleObject))) {
                    vertexURIMap(tripleObject) = nextVertexNum
                    nextVertexNum += 1
                }
                val objectVertexNumber = vertexURIMap(tripleObject)
                edgeArray = edgeArray :+
                    Edge(subjectVertexNumber,objectVertexNumber,triplePredicate)
            }
            else {
                literalPropsTriplesArray = literalPropsTriplesArray :+
                    (subjectVertexNumber,predicateVertexNumber,triple(2))
            }
        }


        // Switch value and key for vertexArray that we'll use to create the
        // GraphX graph.
        for ((k, v) <- vertexURIMap) vertexArray = vertexArray :+  (v, k)   


        // We'll be looking up a lot of prefLabels, so create a hashmap for them. 
        for (i <- 0 until literalPropsTriplesArray.length) {
            if (literalPropsTriplesArray(i)._2 ==
                vertexURIMap("http://www.w3.org/2004/02/skos/core#prefLabel")) {
                // Lose the language tag.
                val prefLabel =
                    languageTagPattern.replaceFirstIn(literalPropsTriplesArray(i)._3,"")
                prefLabelMap(literalPropsTriplesArray(i)._1) = prefLabel;
            }
        }


        // Create RDDs and Graph from the parsed data.


        // vertexRDD Long: the GraphX longint identifier. String: the URI.
        val vertexRDD: RDD[(Long, String)] = sc.parallelize(vertexArray)


        // edgeRDD String: the URI of the triple predicate. Trimming off the
        // first Edge in the array because it was only used to initialize it.
        val edgeRDD: RDD[Edge[(String)]] =
            sc.parallelize(edgeArray.slice(1,edgeArray.length))


        // literalPropsTriples Long, Long, and String: the subject and predicate
        // vertex numbers and the the literal value that the predicate is
        // associating with the subject.
        val literalPropsTriplesRDD: RDD[(Long,Long,String)] =
            sc.parallelize(literalPropsTriplesArray)


        val graph: Graph[String, String] = Graph(vertexRDD, edgeRDD)


        // Create a subgraph based on the vertices connected by SKOS "related"
        // property.
        val skosRelatedSubgraph =
            graph.subgraph(t => t.attr ==
                           "http://www.w3.org/2004/02/skos/core#related")


        // Find connected components  of skosRelatedSubgraph.
        val ccGraph = skosRelatedSubgraph.connectedComponents() 


        // Fill the componentLists hashmap.
        skosRelatedSubgraph.vertices.leftJoin(ccGraph.vertices) {
        case (id, u, comp) => comp.get
        }.foreach
        { case (id, startingNode) => 
          {
              // Add id to the list of components with a key of comp.get
              if (!(componentLists.contains(startingNode))) {
                  componentLists(startingNode) = new ListBuffer[VertexId]
              }
              componentLists(startingNode) += id
          }
        }


        // Output a report on the connected components. 
        println("------  connected components in SKOS \"related\" triples ------\n")
        for ((component, componentList) <- componentLists){
            if (componentList.size > 1) { // don't bother with lists of only 1
                for(c <- componentList) {
                    println(prefLabelMap(c));
                }
                println("--------------------------")
            }
        }


        sc.stop
    }
}

Spark and SPARQL; RDF Graphs and GraphX

Bob DuCharme — Sun, 29 Mar 2015 12:24:38 -0500

Some interesting possibilities for working together.

In Spark Is the New Black in IBM Data Magazine, I recently wrote about how popular the Apache Spark framework is for both Hadoop and non-Hadoop projects these days, and how for many people it goes so far as to replace one of Hadoop’s fundamental components: MapReduce. (I still have trouble writing “Spar” without writing “ql” after it.) While waiting for that piece to be copyedited, I came across 5 Reasons Why Spark Matters to Business by my old XML.com editor Edd Dumbill and 5 reasons to turn to Spark for big data analytics in InfoWorld, giving me a total of 10 reasons that Spark… is getting hotter.

I originally became interested in Spark because one of its key libraries is GraphX, Spark’s API for working with graphs of nodes and arcs. The “GraphX: Unifying Data-Parallel and Graph-Parallel Analytics” paper by GraphX’s inventors (pdf) has a whole section on RDF as related work, saying “we adopt some of the core ideas from the RDF work including the triples view of graphs.” The possibility of using such a hot new Big Data technology with RDF was intriguing, so I decided to look int it.

I thought it would be interesting to output a typical GraphX graph as RDF so that I could perform SPARQL queries on it that were not typical of GraphX processing, and then to go the other way: read a good-sized RDF dataset into GraphX and do things with it that would not be typical of SPARQL processing. I have had some success at both, so I think that RDF and GraphX systems have much to offer each other.

This wouldn’t have been very difficult if I wasn’t learning the Scala programming language as I went along, but GraphX libraries are not available for Python or Java yet, so what you see below is essentially my first Scala program. A huge help in my attempts to learn Scala, Spark, and GraphX were the class handouts of Swedish Institute of Computer Science senior researcher Amir H. Payberah. I just stumbled across them in some web searches while trying to get a Scala GraphX program to compile, and his PDFs introducing Scala, Spark, and graph processing (especially the GraphX parts) lit a lot of “a-ha” lightbulbs for me, and I had already looked through several introductions to Scala and Spark. He has since encouraged me to share the link to course materials for his current course on cloud computing.

While I had a general idea of how functional programming languages worked, one of the lightbulbs that Dr. Payberah’s work lit for me was why they’re valuable, at least in the case of using Spark from Scala: Spark provides higher-order functions that can hand off your own functions and data to structures that can be stored in distributed memory. This allows the kinds of interactive and iterative (for example, machine learning) tasks that generally don’t work well with Hadoop’s batch-oriented MapReduce model. Apparently, for tasks that would work fine with MapReduce, Spark versions also run much faster because their better use of memory lets them avoid all the disk I/O that is typical of MapReduce jobs.

Spark lets you use this distributed memory by providing a data structure called a Resilient Distributed Dataset, or RDD. When you store your data in RDDs, you can let Spark take care of their distribution across a computing cluster. GraphX lets you store a set of nodes, arcs, and—crucially for us RDF types—extra information about each in RDDs. To output a “typical” GraphX graph structure as RDF, I took the Example Property Graph example in the Apache Spark GraphX Programming Guide and expanded it a bit. (If experienced Scala programmers don’t gag when they see my program, they will in my next installment, where I show how I read RDF into GraphX RDDs. Corrections welcome.)

My Scala program below, like the Example Property Graph mentioned above, creates an RDD called users of nodes about people at a university and an RDD called relationships that stores information about edges that connect the nodes. RDDs use long integers such as the 3L and 7L values shown below as identifiers for the nodes, and you’ll see that it can store additional information about nodes—for example, that node 3L is named “rxin” and has the title “student”—as well as additional information about edges—for example, that the user represented by 5L has an “advisor” relationship to user 3L. I added a few extra nodes and edges to give the eventual SPARQL queries a little more to work with.

Once the node and edge RDDs are defined, the program creates a graph from them. After that, I added code to output RDF triples about node relationships to other nodes (or, in RDF parlance, object property triples) using a base URI that I defined at the top of the program to convert identifiers to URIs when necessary. This produced triples such as <http://snee.com/xpropgraph#istoica> <http://snee.com/xpropgraph#colleague> <http://snee.com/xpropgraph#franklin> in the output. Finally, the program outputs non-relationship values (literal properties), producing triples such as <http://snee.com/xpropgraph#rxin> <http://snee.com/xpropgraph#role> "student".

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD


object ExamplePropertyGraph {
    def main(args: Array[String]) {
        val baseURI = "http://snee.com/xpropgraph#"
    val sc = new SparkContext("local", "ExamplePropertyGraph", "127.0.0.1")


        // Create an RDD for the vertices
        val users: RDD[(VertexId, (String, String))] =
            sc.parallelize(Array(
                (3L, ("rxin", "student")),
                (7L, ("jgonzal", "postdoc")),
                (5L, ("franklin", "prof")),
                (2L, ("istoica", "prof")),
                // Following lines are new data
                (8L, ("bshears", "student")),
                (9L, ("nphelge", "student")),
                (10L, ("asmithee", "student")),
                (11L, ("rmutt", "student")),
                (12L, ("ntufnel", "student"))
            ))
        // Create an RDD for edges
        val relationships: RDD[Edge[String]] =
            sc.parallelize(Array(
                Edge(3L, 7L, "collab"),
                Edge(5L, 3L, "advisor"),
                Edge(2L, 5L, "colleague"),
                Edge(5L, 7L, "pi"),
                // Following lines are new data
                Edge(5L, 8L, "advisor"),
                Edge(2L, 9L, "advisor"),
                Edge(5L, 10L, "advisor"),
                Edge(2L, 11L, "advisor")
            ))
        // Build the initial Graph
        val graph = Graph(users, relationships)


        // Output object property triples
        graph.triplets.foreach( t => println(
            s"<$baseURI${t.srcAttr._1}> <$baseURI${t.attr}> <$baseURI${t.dstAttr._1}> ."
        ))


        // Output literal property triples
        users.foreach(t => println(
            s"""<$baseURI${t._2._1}> <${baseURI}role> \"${t._2._2}\" ."""
        ))


        sc.stop


    }
}

The program writes out the RDF with full URIs for each every resource, but I’m showing a Turtle version here that uses prefixes to help it fit on this page better:

@prefix xp: <http://snee.com/xpropgraph#> . 


xp:istoica  xp:colleague xp:franklin .
xp:istoica  xp:advisor   xp:nphelge .
xp:istoica  xp:advisor   xp:rmutt .
xp:rxin     xp:collab    xp:jgonzal .
xp:franklin xp:advisor   xp:rxin .
xp:franklin xp:pi        xp:jgonzal .
xp:franklin xp:advisor   xp:bshears .
xp:franklin xp:advisor   xp:asmithee .
xp:rxin     xp:role      "student" .
xp:jgonzal  xp:role      "postdoc" .
xp:franklin xp:role      "prof" .
xp:istoica  xp:role      "prof" .
xp:bshears  xp:role      "student" .
xp:nphelge  xp:role      "student" .
xp:asmithee xp:role      "student" .
xp:rmutt    xp:role      "student" .
xp:ntufnel  xp:role      "student" .

My first SPARQL query of the RDF asked this: for each person with advisees, how many do they have?

PREFIX xp: <http://snee.com/xpropgraph#>


SELECT ?person (COUNT(?advisee) AS ?advisees)
WHERE {
  ?person xp:advisor ?advisee
}
GROUP BY ?person

Here is the result:

--------------------------
| person      | advisees |
==========================
| xp:franklin | 3        |
| xp:istoica  | 2        |
--------------------------

The next query asks about the roles of rxin’s collaborators:

PREFIX xp: <http://snee.com/xpropgraph#>


SELECT ?collaborator ?role
WHERE {
  xp:rxin xp:collab ?collaborator . 
  ?collaborator xp:role ?role . 
}

As it turns out, there’s only one:

----------------------------
| collaborator | role      |
============================
| xp:jgonzal   | "postdoc" |
----------------------------

Does nphelge have a relationship to any prof, and if so, who and what relationship?

PREFIX xp: <http://snee.com/xpropgraph#>


SELECT ?person ?relationship
WHERE {


  ?person xp:role "prof" . 


  { xp:nphelge ?relationship ?person }
  UNION
  { ?person ?relationship xp:nphelge }


}

And here is our answer:

-----------------------------
| person     | relationship |
=============================
| xp:istoica | xp:advisor   |
-----------------------------

A hardcore RDF person will have two questions about the sample data:

What about properties of edges? For example, what if I wanted to say that an xp:advisor property was an rdfs:subPropertyOf the Dublin Core property dc:contributor?
The ability to assign properties such as a name of “rxin” and a role of “student” to a node like 3L is nice, but what if I don’t have a consistent set of properties that will be assigned to every node—for example, if I’ve aggregated person data from two different sources that don’t use all the same properties to describe these persons?

Neither of those were difficult with GraphX, and next month I’ll show my approach. I’ll also show how I applied that approach to let a GraphX program read in any RDF and then perform GraphX operations on it.

Driving Hadoop data integration with standards-based models instead of code

Bob DuCharme — Fri, 13 Feb 2015 13:43:02 -0500

RDFS models!

Note: I wrote this blog entry to accompany the IBM Data Magazine piece mentioned in the first paragraph, so for people following the link from there this goes into a little more detail on what RDF, triples, and SPARQL are than I normally would on this blog. I hope that readers already familiar with these standards will find the parts about doing the inferencing on a Hadoop cluster interesting.

In a short piece in IBM Data Magazine (migrated, since then, to the IBM Big Data & Analytics Hub) titled Scale up Your Data Integration with Data Models and Inferencing, I give a high-level overview of why the use of W3C standards-based models can provide a more scalable alternative to using code-driven transformations when integrating data from multiple sources:

When driving this process with code generated from models (instead of from the models themselves), evolution of the code makes the code more brittle and turns the original models into out-of-date system documentation.
Mature commercial and open-source tools are available to infer, for example, that a LastName value from one database and a last_name value from another can both be treated as values of FamilyName from a central canonical data model.
After running such a conversion with these models, modifying the conversion to accommodate additional input data often means simply expanding the unifying model, with no need for new code.
It can work on a Hadoop cluster with little more than a brief Python script to drive it all.

Here, we’ll look at an example of how this can work. I’m going to show how I used these techniques to integrate data from the SQL Server sample Northwind database’s “Employees” table with data from the Oracle sample HR database’s “EMPLOYEES” table. These use different names for similar properties, and we’ll identify the relationships between those properties in a model that uses a W3C standard modeling language. Next, a Python script will use this model to combine data from the two different employee tables into one dataset that conforms to a common model. Finally, we’ll see that a small addition to the model, with no new code added to the Python script, lets the script integrate additional data from the different databases. And, we’ll do this all on a Hadoop cluster.

The data and the model

RDF represents facts in three-part {entity, property name, property value} statements known as triples. We could, for example, say that employee 4 has a FirstName value of “Margaret”, but RDF requires that the entity and property name identifiers be URIs to ensure that they’re completely unambiguous. URIs usually look like URLs, but instead of being Universal Resource Locators, they’re Universal Resource Identifiers, merely identifying resources instead of naming a location for them. This means that while some of them might look like web addresses, pasting them into a web browser’s address bar won’t necessarily get you a web page. (RDF also encourages you to represent property values as URIs as well, making it easier to connect triples into graphs that can be traversed and queried. Doing this to connect triples from different sources is another area where RDF shines in data integration work.)

The use of domain names in URIs, as with Java package names, lets an organization control the naming conventions around their resources. When I used D2R—an open source middleware tool that can extract data from popular relational database packages—to pull the employees tables from the Northwind and HR databases, I had it build identifiers around my own snee.com domain name. Doing this, it created entity-name-value triples such as {<http://snee.com/vocab/SQLServerNorthwind#employees_4> <http://snee.com/vocab/schema/SQLServerNorthwind#employees_FirstName> "Margaret"}. A typical fact pulled out of the HR database was {<http://snee.com/vocab/OracleHR#employees_191> <http://snee.com/vocab/schema/OracleHR#employees_first_name> "Randall"}, which tells us that employee 191 in that database has a first_name value of “Randall”. If the HR database also had an employee number 4 or used a column name of first_name, the use of the URIs would leave no question as to which employee or property was being referenced by each triple.

It was simplest to have D2R pull the entire tables, so in addition to the first and last names of each employee, I had it pull all the other data in the Northwind and HR employee tables. To integrate this data, we’ll start with just the first and last names, and then we’ll see how easy it is to broaden the scope of our data integration.

RDF offers several syntaxes for recording triples. RDF/XML was the first to become standardized, but has fallen from popularity as simpler alternatives became available. The simplest syntax, called N-Triples, spells out one triple per line with full URIs and a period at the end, just like a sentence stating a fact would end with a period. Below you can see some of the data about employee 122 from the HREmployees.nt file that I pulled from the HR database’s employees table. (For this and the later N-Triples examples, I’ve added carriage returns to each line to more easily fit them here.)

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_department_id> 
<http://snee.com/vocab/OracleHR#departments_50> .

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_first_name> "Payam" .

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_hire_date> 
"1995-05-01"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://snee.com/vocab/OracleHR#employees_122>
<http://snee.com/vocab/schema/OracleHR#employees_last_name> "Kaufling" .

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_phone_number> "650.123.3234" .

The NorthwindEmployees.nt file pulled by D2R represents the Northwind employees with the same syntax as the HREmployees.nt file but uses URIs appropriate for that data, with “SQLServerNorthwind” in their base URI instead of “OracleHR”.

For a target canonical integration model, I chose the schema.org model designed by a consortium of major search engines for the embedding of machine-readable data into web pages.The following shows the schemaOrgPersonSchema.ttl file, where I’ve stored an excerpt of the schema.org model describing the Person class using the W3C standard RDF Schema (RDFS) language. I’ve added carriage returns to some of the rdfs:comment values to fit them here:

@prefix schema: <http://schema.org/> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dc:     <http://purl.org/dc/terms/> .
@prefix owl:    <http://www.w3.org/2002/07/owl#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

schema:Person a             rdfs:Class;
        rdfs:label          "Person";
        dc:source           <http://www.w3.org/wiki/WebSchemas/SchemaDotOrgSources#source_rNews>;
        rdfs:comment        "A person (alive, dead, undead, or fictional).";
        rdfs:subClassOf     schema:Thing;
        owl:equivalentClass <http://xmlns.com/foaf/0.1/Person> .

schema:familyName a           rdf:Property ;
        rdfs:comment          "Family name. In the U.S., the last name of an Person. 
          This can be used along with givenName instead of the Name property." ;
        rdfs:label            "familyName" ;
        schema:domainIncludes schema:Person ;
        schema:rangeIncludes  schema:Text .

schema:givenName a           rdf:Property ;
       rdfs:comment          "Given name. In the U.S., the first name of a Person. 
         This can be used along with familyName instead of the Name property." ;
       rdfs:label            "givenName" ;
       schema:domainIncludes schema:Person ;
       schema:rangeIncludes  schema:Text .

schema:telephone a           rdf:Property ;
       rdfs:comment          "The telephone number." ;
       rdfs:label            "telephone" ;
       schema:domainIncludes schema:ContactPoint , schema:Organization , 
                             schema:Person , schema:Place ;
       schema:rangeIncludes  schema:Text .

Note that the RDFS “language” is really just a set of properties and classes to use in describing data models, not a syntax. I could have done this with the the N-Triples syntax mentioned earlier, but this excerpt from schema.org uses RDF’s Turtle syntax to describe the class and properties. Turtle is similar to N-Triples but offers a few shortcuts to reduce verbosity:

You can declare prefixes to stand in for common parts of URIs, so that rdfs:label means the same thing as <http://www.w3.org/2000/01/rdf-schema#label>.
A semicolon means “here comes another triple with the same subject as the last one”, letting you list multiple facts about a particular resource without repeating the resource’s URI or prefixed name.
The keyword “a” stands in for the prefixed name rdf:type, so that the first line after the prefix declarations above says that the resource schema:Person has a type of rdfs:Class (that is, that it’s an instance of the rdfs:Class class and is therefore a class itself). The first line about schema:familyName says that it has an rdf:type of rdf:Property, and so forth.

Although Turtle is now the most popular syntax for representing RDF, I used N-Triples for the employee instance data because the use of one line per triple, with no dependencies on prefix declarations or anything else on previous lines, means that a Hadoop system can split up an N-Triples file at any line breaks that it wants to without hurting the integrity of the data.

What if schema.org couldn’t accommodate my complete canonical model? For example, it has no Employee class; what if I wanted to add one that has a hireDate property as well as the other properties shown above? I could simply add triples saying that Employee was a subclass of schema:Person and that hireDate was a property associated with my new class.

I wouldn’t add these modifications directly to the file storing the schema.org model, but instead put them in a separate file so that I could manage local customizations separately from the published standard. (The ability to combine different RDF datasets that use the same syntax—regardless of their respective data models—by just concatenating the files is another reason that RDF is popular for data integration.) This is the same strategy I used to describe my canonical model integration information, storing the following four triples in the integrationModel.ttl file to describe the relationship of the relevant HR and Northwind properties to the schema.org model:

@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix schema:   <http://schema.org/> . 
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw:       <http://snee.com/vocab/schema/SQLServerNorthwind#> .

oraclehr:employees_first_name rdfs:subPropertyOf schema:givenName  . 
oraclehr:employees_last_name  rdfs:subPropertyOf schema:familyName . 
nw:employees_FirstName        rdfs:subPropertyOf schema:givenName  . 
nw:employees_LastName         rdfs:subPropertyOf schema:familyName .

(Note that in RDF, any resource that can be represented by a URI can have properties assigned to it, including properties themselves. This file uses this ability to say that the two oraclehr properties and the two nw properties shown each have an rdfs:subPropertyOf value.) At this point, with my schemaOrgPersonSchema.ttl file storing the excerpt of schema.org that models a Person and my integrationModel.ttl file modeling the relationships between schema:Person and the Northwind and HR input data, I have all the data modeling I need to drive a simple data integration.

The Python script and the Hadoop cluster

Hadoop’s streaming interface lets you configure MapReduce logic using any programming language that can read from standard input and write to standard output, so because I knew of a Python library that could do RDFS inferencing, I wrote the following mapper routine in Python:

#!/usr/bin/python

# employeeInferencing.py: read employee data and models relating it to 
# schema.org, then infer and output schema.org version of relevant facts.

# sample execution:
# cat NorthwindEmployees.nt HREmployees.nt | employeeInferencing.py > temp.ttl

# Reads ntriples from stdin and writes ntriples results to 
# stdout so that it can be used as a streaming Hadoop task. 

import sys
import rdflib
import RDFClosure

diskFileGraph = rdflib.Graph()        # Graph to store data and models

# Read the data from standard input
streamedInput = ""
for line in sys.stdin:
    streamedInput += line
diskFileGraph.parse(data=streamedInput,format="nt")

# Read the modeling information
diskFileGraph.parse(
  "http://snee.com/rdf/inferencingDataIntegration/schemaOrgPersonSchema.ttl",
  format="turtle")
diskFileGraph.parse(
  "http://snee.com/rdf/inferencingDataIntegration/integrationModel.ttl",
  format="turtle")

# Do the inferencing
RDFClosure.DeductiveClosure(RDFClosure.RDFS_Semantics).expand(diskFileGraph)

# Use a SPARQL query to extract the data that we want to return: any
# statements whose properties are associated with the schema:Person
# class. (Note that standard RDFS would use rdfs:domain for this, but
# schema.org uses schema:domainIncludes.)

queryForPersonData = """
PREFIX schema: <http://schema.org/> 
CONSTRUCT { ?subject ?personProperty ?object }
WHERE { 
  ?personProperty schema:domainIncludes schema:Person .
  ?subject ?personProperty ?object .
}"""

personData = diskFileGraph.query(queryForPersonData)

# Add the query results to a graph that we can output.
personDataGraph  = rdflib.Graph()
for row in personData:
    personDataGraph.add(row)

# Send the result to standard out.
personDataGraph.serialize(sys.stdout, format="nt")

After importing the sys library to allow reading from standard input and writing to standard output, the script imports two more libraries: RDFLib, the most popular Python library for working with RDF, and RDFClosure from the related OWL-RL project, which can do inferencing from RDFS modeling statements as well as inferencing that uses the Web Ontology Language (OWL), a more expressive superset of RDFS. (Other available tools for doing RDFS and OWL inferencing include TopQuadrant’s TopSPIN engine, Ontotext’s OWLIM, and Clark & Parsia’s Pellet.) After initializing diskFileGraph as a graph to store the triples that the script will work with, the script reads any N-Triples data fed to it via standard input into this graph and then reads in the schemaOrgPersonSchema.ttl and integrationModel.ttl files of modeling data described above. The identification of these files as http://snee.com/rdf/inferencingDataIntegration/schemaOrgPersonSchema.ttl and http://snee.com/rdf/inferencingDataIntegration/integrationModel.ttl are not URIs in the RDF sense, but actual URLs: send your browser to either and you’ll find copies of those files stored at those locations. That’s where the script is reading them from.

Next, the script computes the deductive closure of the triples aggregated from standard input and the modeling information. For example, when it sees the triple {<http://snee.com/vocab/OracleHR#employees_122> <http://snee.com/vocab/schema/OracleHR#employees_last_name> "Kaufling"} and the triple {oraclehr:employees_last_name rdfs:subPropertyOf schema:familyName}, it infers the new triple {<http://snee.com/vocab/OracleHR#employees_122> schema:familyName "Kaufling"}. Because the inference engine’s job is to infer new triples based on all the relevant ones it can find, newly inferred triples may make new inferences possible, so it continues inferencing until there is nothing new that it can infer from the existing set—it has achieved closure.

At this point, the script will have all of the original triples that it read in plus the new ones that it inferred, but I’m going to assume that applications using data conforming to the canonical model are only interested in that data and not in all the other input. To extract the relevant subset, the script runs a query in SPARQL, the query language from the RDF family of W3C standards. As with SQL, it’s common to see SPARQL queries that begin with SELECT statements listing columns of data to return, but this Python script uses a CONSTRUCT query instead, which returns triples instead of columns of data. The query’s WHERE clause identifies the triples that the query wants by using “triple patterns”, or triples that include variables as wildcards to describe the kinds of triples to look for, and the CONSTRUCT part describes what should be in the triples that get returned.

In this case, the triples to return are any whose predicate value has a schema:domainIncludes value of schema:Person—in other words, any property associated with the schema:Person class. As the comment in the code says, it’s more common for RDFS and OWL models to use the standard rdfs:domain property to associate properties with classes, but this can get messy when associating a particular property with multiple classes, so the schema.org project defined their own schema:domainIncludes property for this.

This SPARQL query could be extended to implement additional logic if necessary. For example, if one database had separate lastName and firstName fields and another had a single name field with values of the form “Smith, John”, then string manipulation functions in the SPARQL query could concatenate the lastName and firstName values with a comma or split the name value at the comma to create new values. This brings the script past strict model-based mapping to include transformation, but most independently-developed data models don’t line up neatly enough to describe their relationships with nothing but simple mappings.

The data returned by the query and stored in the personData variable is not one of RDFLib’s Graph() structures like the diskFileGraph instance that it has been working with throughout the script, so the script creates a new instance called personDataGraph and adds the data from personData to it. Once this is done, all that’s left is to output this graph’s contents to standard out in the N-Triples format, identified as “nt” in the call to the serialize method.

In a typical Hadoop job, the data returned by the mapper routine is further processed by a reducer routine, but to keep this example simple I created a dummyReducer.py script that merely copied the returned data through unchanged:

#!/usr/bin/python
# dummyReducer.py: just copy stdin to stdout

import sys

for line in sys.stdin:
    sys.stdout.write(line)

Running it, expanding the model, and running it again

With my two Python scripts, my two modeling files, and one file of data from each of the two database’s employee tables, I had everything I needed to have Hadoop integrate the data to the canonical model using RDFS inferencing. I set up a four-node Hadoop cluster using the steps describe in part 1 and part 2 of Hardik Pandya’s “Setting up Hadoop multi-node cluster on Amazon EC2”, formatted the distributed file system, and copied the NorthwindEmployees.nt and HREmployees.nt files to the /data/employees directory on that file system. Because the employeeInferencing.py script would be passed to the slave nodes to run on the subsets of input data sent to those nodes, I also installed the RDFLib and OWL-RL Python modules that this script needed on the slave nodes. Then, with the Python scripts stored in /home/ubuntu/dataInt/ on the cluster’s master node, I was ready to run the job with the following command (split over six lines here to fit on this page) on the master node:

hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar 
  -file /home/ubuntu/dataInt/employeeInferencing.py 
  -mapper /home/ubuntu/dataInt/employeeInferencing.py 
  -file /home/ubuntu/dataInt/dummyReducer.py 
  -reducer /home/ubuntu/dataInt/dummyReducer.py 
  -input /data/employees/* -output /data/myOutputDir

After running that, the following copied the result from the distributed file system to a run1.nt file in my local filesystem:

hadoop dfs -cat /data/myOutputDir/part-00000 > outputCopies/run1.nt

Here are a few typical lines from run1.nt:

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/familyName> "King" .   

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/givenName> "Steven" .  

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/familyName> "Fuller" . 

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/givenName> "Andrew" .

The entire file is all schema:givenName and schema:familyName triples about the resources from the Oracle HR and SQL Server Northwind databases.

This isn’t much so far, with the output only having the first and last name values from the two source databases, but here’s where it gets more interesting. We add the following two lines to the copy of integrationModel.ttl stored on the snee.com server:

oraclehr:employees_phone_number rdfs:subPropertyOf schema:telephone .  
nw:employees_HomePhone          rdfs:subPropertyOf schema:telephone .

Then, with no changes to the Python scripts or anything else, re-running the same command on the Hadoop master node (with a new output directory parameter) produces a result with lines like this:

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/familyName> "King" .

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/givenName> "Steven" .

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/telephone> "515.123.4567" .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/givenName> "Andrew" .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/familyName> "Fuller" .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/telephone> "(206) 555-9482" .

Expanding the scope of the data integration required no new coding in the Python script—just an expansion of the integration model. The integration is truly being driven by the model, and not by procedural transformation code. And, adding a completely new data source wouldn’t be any more trouble than adding the phone data was above; you only need to identify which properties of the new data source correspond to which properties of the canonical data model.

Modeling more complex relationships for more complex mapping

All the inferencing so far has been done with just one property from the RDFS standard: rdfs:subPropertyOf. RDFS offers additional modeling constructs that let you do more. As I mentioned earlier, schema.org does not define an Employee class, but if my application needs one, I can use RDFS to define it in my own namespace as a subclass of schema:Person. Also, the Northwind employee data has an nw:employees_HireDate property that I’d like to associate with my new class. I can do both of these by adding these two triples to integrationModel.ttl, shown here with a prefix declaration to make the triples shorter:

@prefix emp: <http://snee.com/vocab/employees#> .
emp:Employee rdfs:subClassOf schema:Person . 
nw:employees_HireDate rdfs:domain emp:Employee .

The SPARQL query in employeeInferencing.py only looked for properties associated with instances of schema:Person, so after expanding that a bit to request the Employee and class membership triples as well, running the inferencing script shows us that the RDFSClosure engine has inferred these new triples about Andrew Fuller:

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://snee.com/vocab/employees#Employee> .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://schema.org/Person> .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_HireDate> 
"1992-08-14T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

In other words, because he has an nw:employees_HireDate value, it inferred that he is an instance of the class emp:Employee, and because that’s a subclass of schema:Person, we see that he is also a member of that class.

The W3C’s OWL standard adds additional properties beyond those defined by RDFS to further describe your data, as well as special classes and the ability to define your own classes to use in describing your data. For example, if the HR database’s departments table had a related property so that you could specify that the shipping department is related to the receiving department, then specifying in our integration model that {nw:related rdf:type owl:SymmetricProperty} would tell the RDFSClosure engine that this property is symmetric and that it should infer that the receiving department was related to shipping department. (When telling RDFSClosure’s DeductiveClosure method to do OWL inferencing in addition to RDFS inferencing, pass it an RDFS_OWLRL_Semantics parameter instead of RDFS_Semantics.)

OWL also includes an owl:inverseOf property that can help with data integration. For example, imagine that the Northwind database had an nw:manages property that let you say things like {emp:jack nw:manages emp:shippingDepartment}, but the HR database identified the relationship in the opposite direction with an oraclehr:managedBy relationship used in triples of the form {emp:receivingDepartment oraclehr:managedBy emp:jill}. When you tell an OWL engine that these two properties are the inverse of each other with the triple {oraclehr:managedBy owl:inverseOf nw:manages}, it will infer from the triples above that {emp:shippingDepartment oraclehr:managedBy emp:jack} and that {emp:jill nw:manages emp:receivingDepartment}.

When processing of the input is distributed over multiple nodes, as with a Hadoop cluster, this inferencing has some limitations. For example, the owl:TransitiveProperty class lets me say that an ex:locatedIn property is transitive by using a triple such as {ex:locatedIn rdf:type owl:TransitiveProperty}. Then, when an OWL engine sees that {ex:chair38 ex:locatedIn ex:room47} and that {ex:room47 ex:locatedIn ex:building6}, it can infer that {ex:chair38 ex:locatedIn ex:building6}. When distributing the processing across a Hadoop cluster, however, the {ex:chair38 ex:locatedIn ex:room47} triple may get sent to one node and the {ex:room47 ex:locatedIn ex:building6} triple to another, so neither will have enough information to infer which building the chair is in. So, when you review the RDFS and OWL standards for properties and classes that you can use to describe the data that you want to integrate on a distributed Hadoop system, keep in mind which of these can do their inferencing based on a single triple of instance data input and which require multiple triples.(The Reduce step of a MapReduce, where above I just put a dummy script to copy the data through, would be a potential place to do additional inferencing based on the output of the mapping steps done on the distributed Hadoop nodes.)

Other tools for working with RDF on Hadoop

There have been other projects for taking advantage of the RDF data model on Hadoop before I tried this, and there are more coming along. At ApacheCon Europe in 2012, Cloudera’s Paolo Castagna (formerly of Kasabi, Talis, and HP Labs in Bristol, which is quite an RDF pedigree) gave a talk titled “Handling RDF data with tools from the Hadoop ecosystem” (slides PDF) where he mostly covered the application of popular Hadoop tools to N-Triples files, but he also described his jena-grande project to mix the Apache Jena RDF library with these tools. At the 2014 ApacheCon, YarcData’s Rob Vesse gave a talk titled “Quadrupling Your Elephants: RDF and The Hadoop Ecosystem” (slides PDF), which reviewed tools for using RDF on Hadoop and described the Jena Hadoop RDF tools project, which has since been renamed as Jena Elephas. (Rob described Paolo’s jena-grande as a “useful reference & inspiration in developing the new stuff”.)

The kind of scripting that I did with Hadoop’s streaming interface is a great way to get Hadoop tasks up and running quickly, but more serious Hadoop applications are typically written in Java, as I’ve described in a recent blog entry, and by bringing the full power of Jena to this kind of development, Elephas will open up some great new possibilities for taking advantage of the RDF data model (and SPARQL, and RDFS, and OWL) on Hadoop. I’m definitely looking forward to seeing where that leads.

R (and SPARQL), part 2

Bob DuCharme — Tue, 20 Jan 2015 08:32:54 -0500

Retrieve data from a SPARQL endpoint, graph it and more, then automate it.

In the future whenever I use SPARQL to retrieve numeric data I'll have some much more interesting ideas about what I can do with that data.

In part 1 of this series, I discussed the history of R, the programming language and environment for statistical computing and graph generation, and why it’s become so popular lately. The many libraries that people have contributed to it are a key reason for its popularity, and the SPARQL one inspired me to learn some R to try it out. Part 1 showed how to load this library, retrieve a SPARQL result set, and perform some basic statistical analysis of the numbers in the result set. After I published it, it was nice to see how its comments section filled up with a nice list of projects out there that combine R and SPARQL.

If you executed the sample commands from Part 1 and saved your session when quitting out of R (or in the case of what I was doing last week, RGui), all of the variables set in that session will be available for the commands described here. Today we’ll look at a few more commands for analyzing the data, how to plot points and regression lines, and how to automate it all so that you can quickly perform the same analysis on different SPARQL result sets. Again, corrections welcome.

My original goal was to find out how closely the number of employees in the companies making up the Dow Jones Industrial Average correlated with the net income, which we can find out with R’s cor() function:

> cor(queryResult$netIncome,queryResult$numEmployees)
[1] 0.1722887

A correlation figure close to 1 or -1 indicates a strong correlation (a negative correlation indicates that one variable’s values tend to go in the opposite direction of the other’s—for example, if incidence of a certain disease goes down as the use of a particular vaccine goes up) and 0 indicates no correlation. The correlation of 0.1722887 is much closer to 0 than it is to 1 or -1, so we see very little correlation here. (Once we automate this series of steps, we’ll finder strong correlations when we focus on specific industries.)

More graphing

We’re going to graph the relationship between the employee and net income figures, and then we’ll tell R to draw a straight line that fits as closely as possible to the pattern created by the plotted values. This is called a linear regression model, and before we do that we tell R to calculate some data necessary for this task with the lm() (“linear model”) function:

> myLinearModelData <- lm(queryResult$numEmployees~queryResult$netIncome)

Next, we draw the graph:

> plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
   ylab="# of employees", main="Dow Jones Industrial Average companies")

As with the histogram that we saw in Part 1, R offers many ways to control the graph’s appearance, and add-in libraries let you do even more. (Try a Google image search on “fancy R plots” to get a feel for the possibilities.) In the call to plot() I included three parameters to set a main title and labels for the X and Y axes, and we see these in the result:

We can see more intuitively what the cor() function already told us: that there is minimal correlation between the rise of employee counts and net income in the companies comprising the Dow Jones Industrial average.

Let’s put the data that we stored in myLinearModelData to use. The abline() function can use it to add a regression line to our plot:

> abline(myLinearModelData)

When you type in function calls such as sd(queryResult$numEmployees) and cor(queryResult$netIncome,queryResult$numEmployees), R prints the return values as output, but you can use the return values in other operations. In the following, I’ve replotted the graph with the cor() function call’s result used in a subtitle for the graph, concatenated onto the string “correlation: " with R’s paste() function:

 plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
   ylab="# of employees", main="Dow Jones Industrial Average companies",
   sub=paste("correlation: ",cor(queryResult$numEmployees,
             queryResult$netIncome),sep=""))

(The paste() function’s sep argument here shows that we don’t want any separator between our concatenated pieces. I’m guessing that paste() is more typically used to create delimited data files.) R puts the subtitle at the image’s bottom:

Instead of plotting the graph on the screen, we can tell R to send it to a JPEG, BMP, PNG, or TIFF file. Calling a graphics devices function such as jpeg() before doing the plot tells R to send the results to a file, and dev.off() turns off the “device” that writes to the image file.

Automating it

Now we know nearly enough commands to create a useful script. The remainder are just string manipulation functions that I found easy enough to look up when I needed them, although having a string concatenation command called paste() is another example of the odd R terminology that I warned about last week. Here is my script:

library(SPARQL) 


category <- "Companies_in_the_Dow_Jones_Industrial_Average"
#category <- "Electronics_companies_of_the_United_States"
#category <- "Financial_services_companies_of_the_United_States"


query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
WHERE {
  ?s dcterms:subject <http://dbpedia.org/resource/Category:DUMMY-CATEGORY-NAME> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     FILTER(contains(?netIncomeDollars,'E'))
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # Original query on following line had two 
     # slashes, but R needed both escaped.
     FILTER(!(regex(?numEmployees,'\\\\d+')))
}
ORDER BY ?numEmployees"


query <- sub(pattern="DUMMY-CATEGORY-NAME",replacement=category,x=query)


endpoint <- "http://dbpedia.org/sparql"
resultList <- SPARQL(endpoint,query)
queryResult <- resultList$results 
correlationLegend=paste("correlation: ",cor(queryResult$numEmployees,
                         queryResult$netIncome),sep="")
myLinearModelData <- lm(queryResult$numEmployees~queryResult$netIncome) 
plotTitle <- chartr(old="_",new=" ",x=category)
outputFilename <- paste("c:/temp/",category,".jpg",sep="")
jpeg(outputFilename)
plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
     ylab="number of employees", main=plotTitle,cex.main=.9,
     sub=correlationLegend)
abline(myLinearModelData) 
dev.off()

Instead of hardcoding the URI of the industry category whose data I wanted, my script has DUMMY-CATEGORY-NAME, a string that it substitutes with the category value assigned at the script’s beginning. The category value here is “Companies_in_the_Dow_Jones_Industrial_Average”, with the setting of two other potential category values commented out so that we can easily try them later. (R, like SPARQL, uses the # character for commenting.) I also used the category value to create the output filename.

An additional embellishment to the sequence of commands that we entered manually is that the script stores the plot title in a plotTitle variable, replacing the underscores in the category name with spaces. Because this sometimes resulted in titles that were too wide for the plot image, I added cex.main=9 as a plot() argument to reduce the title’s size.

With the script stored in /temp/myscript.R, entering the following at the R prompt runs it:

source("/temp/myscript.R")

If I don’t have an R interpreter up and running, I can run the script from the operating system command line by calling rscript, which is included with R:

rscript /temp/myscript.R

After it runs, my /temp directory has this Companies_in_the_Dow_Jones_Industrial_Average.jpg file in it:

When I uncomment the script’s second category assignment line instead of the first and run the script again, it creates the file Electronics_companies_of_the_United_States.jpg:

There’s better correlation this time, of almost .5. Fitting two particular outliers onto the plot means that R put enough points in the lower-left to make a bit of a blotch; I did find with experimentation that the plot() command offers parameters to only display the points within a particular range of values on the horizontal or vertical axis, making it easier to show a zoomed view.

Here’s what we get when querying about Financial_services_companies_of_the_United_States:

We see the strongest correlation yet: over .84. I suppose that at financial services companies, hiring more people is more likely to increase revenue than in other typical sectors because you can provide (and charge for) a higher volume of services. This is only a theory, but that’s why people use statistical analysis packages: to look for patterns that can suggest theories, and it’s great to know that such a powerful open-source package can do this with data retrieved from SPARQL endpoints.

If I was going to run this script from the operating system command line regularly, then instead of setting the category value at the beginning of the script, I would pass it to rscript as an argument with the script name.

Learning more about R

Because of R’s age and academic roots, there is a lot of stray documentation around, often in LaTeXish-looking PDFs from several years ago. Many introductions to R are aimed at people in a specific field, and I suppose my blog entries here fall in this category.

The best short, modern tour of R that I’ve found recently is Sharon Machlis’s six-part series beginning at Beginner’s Guide to R: Introduction. Part six points to many other places to learn about R ranging from blog entries to complete books to videos, and reviewing the list now I see more entries that I hadn’t noticed before that look worth investigating.

Her list is where I learned about Jeffrey M. Stanton’s Introduction to Data Science, an excellent introduction to both Data Science and to the use of R to execute common data science analysis tasks. The link here goes to an iTunes version of the book, but there’s also a PDF version, which I read beginning to end.

The R Programming Wikibook makes a good quick reference work, especially when you need a particular function for something; see the table of contents down its right side. I found myself going back to the Text Processing page there several times. The four-page “R Reference Card” (pdf) by Tom Short is also worth printing out.

Last week I mentioned John D. Cook’s R language for programmers, a blog entry that will help anyone familiar with typical modern programming languages get over a few initial small humps more quickly when learning R.

I described Machlis’s six-part series as “short” because there there are so many full-length books on R out there, such as free ones like Stanton’s and several offerings from O’Reilly and Manning. I’ve read the first few chapters of Manning’s R in Action by Robert Kabakoff and find it very helpful so far. Apparently a new edition is coming out in March, so if you’re thinking of buying it you may want to wait or else get the early access edition. Manning’s Practical Data Science with R also looks good, but assumes a bit of R background (in fact, it recommends “R in Action” as a starting point), and a real beginner to this area would be better off starting with Stanton’s free book mentioned above.

O’Reilly has several books on R, including an R Cookbook whose very task-oriented table of contents is worth skimming, as well as an accompanying R Graphics Cookbook.

I know that I’ll be going back to several of these books and web pages, because in the future whenever I use SPARQL to retrieve numeric data I’ll have some much more interesting ideas about what I can do with that data.

R (and SPARQL), part 1

Bob DuCharme — Tue, 13 Jan 2015 08:26:20 -0500

Or, R for RDF people.

R is a programming language and environment for statistical computing and graph generation that, despite being over 30 years old, has gotten hot lately because it’s an open-source, cross-platform tool that brings a lot to the world of Data Science, a recently popular field often associated with the analytics aspect of the drive towards Big Data. The large, active community around R has developed many add-on libraries, including one for working with data retrieved from SPARQL endpoints, so I thought I’d get to know R well enough to try that library. I first learned about this library from SPARQL with R in Less than 5 Minutes, which describes Semantic Web and Linked Data concepts to people familiar with R in order to demonstrate what they can do together; my goal here is to explain R to people familiar with RDF for the same reason. (Corrections to any misuse of statistical terminology are welcome.)

an open-source, cross-platform tool that brings a lot to the world of Data Science

R has also been called “GNU S,” and first appeared in 1993 as an implementation of a statistical programming language developed at Bell Labs in 1976 known as S. (This is cuter if you know that the C programming language was also developed at Bell Labs as a successor to a language called B.) Its commercial competition includes Stata, SAS, and SPSS, all of whom have plenty to fear from R as its its power and reputation grow while its cost stays at zero. According to a recent article in Nature on R’s growing popularity among scientists, “In the past decade, R has caught up with and overtaken the market leaders.”

Downloading and installing R on a Windows machine gave me an icon that opened up the RGui windowed environment, which contains a console window where you enter commands that add other windows within RGui as needed for graphics. (The distribution also includes an executable that you can run from your operating system command line; as we’ll see next week, you can use this to run scripts as well.) Most discussions of R recommend the open source RStudio as a more serious IDE for R development, but RGui was enough for me to play around.

Some of R’s syntax is a bit awkward in places, possibly because of its age—some of its source code is written in Fortran, and it actually lets you call Fortran subroutines. I found some of its terminology to be awkward as well, but probably because it was designed for statisticians and not for programmers accustomed to typical modern programming languages. I highly recommend the quick tour of syntax quirks in R language for programmers by John D. Cook for such people when they’re getting started with R.

For example, where I think of a table or a spreadsheet as consisting of rows and columns, R describes a data frame of observations and variables, meaning essentially the same thing. Of the simpler structures that come up in R, a vector is a one-dimensional set (I almost said “array” or “list” instead of “set” but these have different, specific meanings in R) of values of the same type, a matrix is a two-dimensional version, and an array is a three-dimensional version. A data frame looks like a matrix but “columns can be different modes” (that is, different properties and types), as described on the Data types page of the Quick-R website. The same page says that “data frames are the main structures you’ll use to store datasets,” which makes sense when you consider their similarity to spreadsheets, relational database tables, and, in the RDF world, SPARQL result sets.

I don’t want to make too much of what may look like quirky terminology and syntax to people accustomed to other modern programming languages. I have come to appreciate the way R makes the most popular statistical operations so easy to carry out—even easier than Excel or LibreOffice Calc, which have a surprising amount of basic statistical operations built in.

Retrieving data from a SPARQL endpoint

Below I’ve walked through a session of commands entered at an R command line that you can paste into an R session yourself, not counting the > prompt shown before each command. Let’s say that, using data retrieved from DBpedia, I’m wondering if there’s a correlation between the number of employees and the amount of net income in a given set of companies. (I only used U.S. companies to make it easier to compare income figures.) Typically, companies with more employees have more net income, but do they correlate more closely in some industries than others? R lets you quantify and graph this correlation very easily, and along the way we’ll see a few other things that it can do.

To start, I install the SPARQL package with this command, which starts up a wizard that loads it from a remote mirror:

> install.packages("SPARQL")

After R installed the package, I loaded it for use in this session. The help() function can tell us more about an installed package:

> library(SPARQL)
> help(package="SPARQL")

The help() function pops up a browser window with documentation of the topic passed as an argument. You can pass any function name to help() as well, so you can enter something like help(library()) or even help(help).

Analyzing the result

The next command uses R’s <- assignment operator to assign a big multi-line string to the variable query. The string holds a SPARQL query that will be sent to DBpedia; you can run the same query on DBpedia’s SNORQL interface to get a preview of the data (the query sent by that link is slightly different—see the last SPARQL comment in the query below):

> query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
WHERE {
  ?s dcterms:subject <http://dbpedia.org/resource/Category:Companies_in_the_Dow_Jones_Industrial_Average> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     FILTER(contains(?netIncomeDollars,'E'))
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # original query on following line had two slashes, but 
     # R needed both escaped
     FILTER(!(regex(?numEmployees,'\\\\d+')))
}
ORDER BY ?numEmployees"

The query asks for the net income and employee count figures for companies that comprise the Dow Jones Industrial Average. The SPARQL comments within the query describe the query’s steps in more detail.

Next, we assign the endpoint’s URL to the endpoint variable and call the SPARQL package’s SPARQL() function to send the query to that endpoint, storing the result in a resultList variable:

> endpoint <- "http://dbpedia.org/sparql"
> resultList <- SPARQL(endpoint,query)
> typeof(resultList)
[1] "list"

The third command there, and R’s output, show that resultList has a type of list, which is described on the Data types page mentioned earlier as an “ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name.” (Compare this with a vector, where everything must have the same type, or in R-speak, the same mode.)

The next command uses the very handy summary() function to learn more about what the SPARQL() function put into the resultList variable:

> summary(resultList)
           Length Class      Mode
results    3      data.frame list
namespaces 0      -none-     NULL

It shows a list of two things: our query results and an empty list of namespaces. Because we don’t care about the empty list of namespaces, we’ll make it easier to work with the results part by pulling it out and storing it in its own queryResult variable using the $ operator to identify the part of resultList that we want. Then, we use the str() function to learn more about what’s in there:

> queryResult <- resultList$results 
> str(queryResult)
'data.frame':   27 obs. of  3 variables:
 $ label       : chr  "\"Visa Inc.\"@en" "\"The Travelers Companies\"@en" ...
 $ numEmployees: int  8500 30500 32900 44000 62800 64600 70000 ...
 $ netIncome   : num  2.14e+09 2.47e+09 8.04e+09 2.22e+09 5.36e+09 ...

The output tells us that it’s a data frame, mentioned earlier as “the main structures you’ll use to store datasets,” with 27 obs[ervations] and 3 variables (that is, rows and columns).

The summary() function tells us some great stuff about a data frame—a set of information that would be much more work to retrieve if the same data was loaded into a spreadsheet program:

> summary(queryResult)
    label            numEmployees       netIncome        
 Length:27          Min.   :   8500   Min.   :2.144e+09  
 Class :character   1st Qu.:  72500   1st Qu.:4.863e+09  
 Mode  :character   Median : 107600   Median :8.040e+09  
                    Mean   : 205227   Mean   :1.050e+10  
                    3rd Qu.: 171711   3rd Qu.:1.530e+10  
                    Max.   :2200000   Max.   :3.258e+10

The SPARQL query’s SELECT statement asked for the label, numEmployees, and netIncome values, and we see some interesting information about the values returned for these, especially the numeric ones: the minimum, maximum, and mean (average) values of each, as well the boundary values if you split the returned values as closely as possible into four even groups known in statistics as quartiles. The first quartile value marks the boundary between the bottom quarter and the next quarter, the median splits the values in half, and the third quartile splits the top quarter from the third one.

We can very easily ask for the variance—a measure of how far apart all the values are spread from the mean—as well as the standard deviation, a useful measurement for describing how far any specific value is from the mean:

> var(queryResult$numEmployees)
[1] 167791342395
> sd(queryResult$numEmployees)
[1] 409623.4

Our first plot: a histogram

For our first step into graphics, we’ll create a histogram, which illustrates the distribution of values. As with all R graphics, there are plenty of parameters available to control the image’s appearance, but we can get a pretty useful histogram by sticking with the defaults:

hist(queryResult$netIncome)

When running this interactively, RGui opens up a new window and displays the image there:

Next week we’ll learn how to plot the specific points in the data, how to make the graph titles look nicer, and how to quantify the correlation between the two sets of values. (If you’ve been entering the commands shown here, then when you quit R with the quit() command or by picking Exit from RGui’s File menu, it offers to save your workspace image for re-use the next time you start it up, so all of the variables that were set in a session like this will still be available in the next session.) We’ll also see how to automate this series of steps to make it easier to generate a graph, with the correlation figure included, as a JPEG file. This automation will make it easier to graph the results and find the correlation figures for different industries. Finally, I’ll list the best resources I found for learning R—there are a lot of them out there, of wildly varying quality.

Meanwhile, you can gaze at this R plot of a Mandelbrot set from R’s Wikipedia page, which includes all the commands necessary to generate it:

Hadoop

Bob DuCharme — Sat, 13 Dec 2014 09:13:36 -0500

What it is and how people use it: my own summary.

The web offers plenty of introductions to what Hadoop is about. After reading up on it and trying it out a bit, I wanted to see if I could sum up what I see as the main points as concisely as possible. Corrections welcome.

Hadoop is an open source Apache project consisting of several modules. The key ones are the Hadoop Distributed File System (whose acronym is trademarked, apparently) and MapReduce. The HDFS lets you distribute storage across multiple systems and MapReduce lets you distribute processing across multiple systems by performing your “Map” logic on the distributed nodes and then the “Reduce” logic to gather up the results of the map processes on the master node that’s driving it all.

This ability to spread out storage and processing makes it easier to do large-scale processing without requiring large-scale hardware. You can spread the processing across whatever boxes you have lying around or across virtual machines on a cloud platform that you spin up for only as long as you need them. This ability to inexpensively scale up has made Hadoop one of the most popular technologies associated with the buzzphrase “Big Data.”

Writing Hadoop applications

Hardcore Hadoop usage often means writing the map and reduce tasks in Java programs that must import special Hadoop libraries and play by Hadoop rules; see the source of the Apache Hadoop Wiki’s Word Count program for an example. (Word count programs are ubiquitous in Hadoop primers.) Then, once you’ve started up the Hadoop background processes, you can use Hadoop command line utilities to indicate the JAR file with your map and reduce logic and where on the HDFS to look for input and to put output. While your program runs, you can check on its progress with web interfaces to the various background processes.

Instead of coding and compiling your own JAR file, one nice option is to use the hadoop-streaming-*.jar one that comes with the Hadoop distribution to hand off the processing to scripts you’ve written in just about any language that can read from standard input and write to standard output. There’s no need for these scripts to import any special Hadoop libraries. I found it very easy to go through Michael G. Noll’s Writing an Hadoop MapReduce Program in Python tutorial (creating yet another word count program) after first doing his Running Hadoop on Ubuntu Linux (Single-Node Cluster) tutorial to set up a small Hadoop environment. (If you try one of the many Hadoop tutorials you can find on the web, make sure to run the same version of Hadoop that the tutorial’s author did. The 2.* Hadoop releases are different enough from the 1.* ones that if you try to set up a distributed file system and share processing across it using a recent release while following instructions written using a 1.* release, there are more opportunities for problems. I had good luck with Hardik Pandya’s “How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2,” split into Part 1 and Part 2, when I used the same release that he did.)

Hadoop’s native scripting environments

Instead of writing your own applications, you can take advantage of the increasing number of native Hadoop scripting languages that shield you from the lower-level parts. Several popular ones build on HCatalog, a layer built on top of the HDFS. As the Hortonworks Hadoop tutorial Hello World! – An introduction to Hadoop with Hive and Pig puts it, “The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools.” You can work with HCatalog directly, but it’s more common to use these other tools that are built on top of it, and you’ll often see HCatalog mentioned in discussions of those tools. (For example, the same tutorial refers to the need to register a file with HCatalog before Hive or Pig can use it.)

Apache Hive, according to its home page, “facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.” You can start up Hive and enter HiveQL commands at its prompt or you can pass it scripts instead of using it interactively. If you know the basics of SQL, you’ll be off and running pretty quickly. The 4:33 video Demonstration of Apache Hive by Rob Kerr gives a nice short introduction to writing and running Hive scripts.

Apache Pig is another Hadoop utility that takes advantage of HCatalog. The “Pig Latin” scripting language is less SQL-like (but straightforward enough) and lets you create data structures on the fly so that you can pipeline data through a series of steps. You can run its commands interactively at its grunt shell or in batch mode from the operating system command line.

When should you use Hive and when should you use Pig? It’s a common topic of discussion; a Google search for “pig vs. hive” gets over 2,000 hits. Sometimes it’s just a matter of convention at a particular shop. The stackoverflow thread Difference between Pig and Hive? Why have both? has some good points as well as pointers to more detailed discussions, including a Yahoo developer network discussion that doesn’t mention Hive by name but has a good description of the basics of Pig and how it compares to an SQL approach.

You know what would be cool? A Hive adapter for D2R.

Hive and Pig are both very big in the Hadoop world, but plenty of other such tools are coming along. The home page of Apache Storm tells us that it “makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.” Apache Spark provides Java, Scala, and Python APIs and promises greater speed and an ability to layer on top of many different classes of data sources as its main advantages. There are other tools, but I mention these two because according to the recent O’Reilly 2014 Data Science Salary Survey, “Storm and Spark users earn the highest median salary” of all the data science tools they surveyed. Neither is restricted to use with Hadoop, but the big players described below advertise support for one or both as advantages of their Hadoop distributions.

Another popular tool in the Hadoop ecosystem is Apache HBase, the most well-known of the column-oriented NoSQL databases. It can sit on top of HDFS, and its tables can host both input and output for MapReduce jobs.

The big players

The companies Cloudera, HortonWorks, and MapR have gotten famous and made plenty of money selling and supporting packaged Hadoop distributions that include additional tools to make them easier to set up and use than the Apache downloads. After hearing that HortonWorks stayed closer to the open source philosophy than the others, I tried their distribution and found that it includes many additional web-based tools to shield you from the command line. For example, it lets you enter Hive and Pig Latin commands into IDE-ish windows designed around these tools, and it includes a graphical drag-and-drop file browser interface to the HDFS. I found the tutorials in the “Hello World” section of their Tutorials page to be very helpful. I have no experience with the other two companies, but a Google search on cloudera hortonworks mapr finds a lot of discussions out there comparing the three.

Pre-existing big IT names such as IBM and Microsoft have also jumped into the Hadoop market; when you do a Google search for just hadoop, it’s interesting to see which companies have paid relatively how much for Google AdWord placement.

Hadoop’s future

One of Hadoop’s main uses so far has been to batch process large amounts of data (usually data that fits into one giant table, such as server or transaction logs) to harvest summary data that can be handed off to analytics packages. This is why SAS and Pentaho, who do not have their own Hadoop distributions, have paid for good Google AdWord placement when you search for “hadoop”—they want you to use their products for the analytics part.

A hot area of growth seems to be the promise of using Hadoop for more real-time processing, which is driving the escalation in Storm and Spark’s popularity. Even in batch processing, there are still plenty of new opportunities in the Hadoop world as people adapt more kinds of data for use with the growing tool set. The “one giant table” representation is usually necessary to ease the splitting up of your data for distribution across multiple nodes; with my RDF hat on, I think there are some interesting possibilities for representing complex data structures in Hadoop using the N-Triples RDF syntax, which will still look like one giant three- (or four-) column table to Hadoop.

Cloudera’s Paolo Castagna has done some work in this direction, as described in his presentation “Handling RDF data with tools from the Hadoop ecosystem” (pdf). A more recent presentation Quadrupling your Elephants: RDF and the Hadoop Ecosystem by YarcData’s Rob Vesse shows some interesting work as well, including the beginnings of some Jena-based tools for processing RDF with Hadoop. There has been some work at the University of Freiberg on SPARQL query processing using Hadoop (pdf), and SPARQL City also offers a SPARQL front end to Hadoop-based storage. (If anyone’s looking for a semantic web project idea, you know what would be cool? A Hive adapter for D2R.) I think there’s a very bright future for the cross-pollination for all of these tools.

Querying aggregated Walmart and BestBuy data with SPARQL

Bob DuCharme — Sun, 09 Nov 2014 09:35:56 -0500

From structured data in their web pages!

The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I’ve been learning more about microdata recently, but even before I did, I found that the W3C’s Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product’s web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

I extracted the data describing six external USB drives from both walmart.com and bestbuy.com, limiting myself to models that were available on both websites. (Instead of pulling it separately from the twelve individual web pages, it would have been nice to automate this a bit more. I did sign up for Walmart’s API program, which was easy to try out, but the part of the API that lets you query products by category is “restricted, and is available on a request basis” according to their Data Feed API home page, so I didn’t bother. If I was going to pursue this further I would enroll in BestBuy’s Developer Program as well.) After using the Distiller form to do this several times, I downloaded its Python script from the pymicrodata github page and found it easy to run locally.

You can see a Turtle file of aggregated Walmart plus Bestbuy data here. Because of some slight differences in how they treated certain bits of data, I was tempted to clean up the aggregated data before querying it, but I really wanted to write queries that would work on the data in its native form, so I put the cleanup steps right in the queries.

The various queries that I wrote led up to this one, which lists all the products by model number and price for easy comparison:

PREFIX schema: <http://schema.org/> 
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#> 


SELECT ?productName ?modelNumber ?price ?sellerName 
WHERE {
   ?product a schema:Product . 
   ?product schema:name ?productNameVal . 
   # str() to strip any language tags
   BIND(str(?productNameVal) AS ?productName)
   ?product schema:model ?modelNumberVal . 
   BIND(str(?modelNumberVal) AS ?modelNumber)
   ?product schema:offers ?offer . 
   ?offer a schema:Offer . 
   ?offer schema:price ?priceVal . 
   # Remove $ and cast to decimal
   BIND(xsd:decimal(replace(?priceVal,"\$","")) AS ?price)
   ?offer schema:seller ?seller. 
   # In case there's a level of indirection for seller name
   OPTIONAL {
    ?seller schema:name ?sellerSchemaName . 
   }
   BIND(str(coalesce(?sellerSchemaName,?seller)) AS ?sellerName )
}
ORDER BY ?modelNumber ?price

Each comment in the query describes how it accounts for some difference between the Walmart microdata and the BestBuy microdata—for example, the BestBuy data included a dollar sign with prices, but the Walmart data did not.

After running the query, requesting XML output, and then running a little XSLT on that output, I ended up with the table shown below.

Product Name Model Number Price Seller Name

Buffalo - DriveStation Axis Velocity 2TB External USB 3.0/2.0 Hard Drive HD-LX2.0TU3 106.99 BestBuy Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 108.25 Walmart.com Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 129.45 pcRUSH Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 143.69 Tonzof

Toshiba - Canvio Basics 1 TB External Hard Drive HDTB210XK3BA 68.60 Buy.com Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 73.84 pcRUSH Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 99.0 Walmart.com

Toshiba Canvio Basics 2TB USB 3.0 External Hard Drive HDTB220XK3CA 103.14 Walmart.com Toshiba - Canvio Basics Hard Drive HDTB220XK3CA 108.57 Buy.com

Seagate - Backup Plus Slim 1TB External USB 3.0/2.0 Portable Hard Drive - Black STDR1000100 69.99 BestBuy Seagate Backup Plus 1TB Slim Portable External Hard Drive, Black STDR1000100 89.99 Walmart.com

WD - My Book 3TB External USB 3.0 Hard Drive - Black WDBFJK0030HBK-NESN 128.99 BestBuy WD My Book 3TB USB 3.0 External Hard Drive WDBFJK0030HBK-NESN 129.99 Walmart.com

WD - My Book 4TB External USB 3.0 Hard Drive - Black WDBFJK0040HBK-NESN 149.99 BestBuy WD My Book 4TB USB 3.0 External Hard Drive WDBFJK0040HBK-NESN 169.99 Walmart.com

Vendors other than Walmart and BestBuy on the list were included in the Walmart data.

Unfortunately, since I pulled the data that I was working with on October 15th, Walmart seems to have changed their web pages so that the W3C Microdata to RDF Distiller doesn’t find the data in them anymore. I still see schema.org microdata in the source of a page like this Walmart page for an external hard drive, but I guess it’s arranged differently. Perhaps they didn’t want people using standards-based technology to automate the process of finding out that BestBuy’s external hard drives usually cost less, or at least did in mid-October. A random check of products on other websites showed that the Distiller could pull useful data out of pages on target.com, llbean.com, and markesandspencer.com, so plenty of other major retailers are providing schema.org microdata in their product web pages.

The important thing is that, even before I knew anything about the structure and syntax of microdata, a publicly available open source program let me pull and aggregate data from different big box stores’ web sites so that I could query the combination with SPARQL. With more and more brand name retailers making data available for this, this will definitely make some interesting applications possible in the future.

Dropping OPTIONAL blocks from SPARQL CONSTRUCT queries

Bob DuCharme — Mon, 06 Oct 2014 19:42:01 -0500

And retrieving those triples much, much faster.

While preparing a demo for the upcoming Taxonomy Boot Camp conference, I hit upon a trick for revising SPARQL CONSTRUCT queries so that they don’t need OPTIONAL blocks. As I wrote in the new “Query Efficiency and Debugging” chapter in the second edition of Learning SPARQL, “Academic papers on SPARQL query optimization agree: OPTIONAL is the guiltiest party in slowing down queries, adding the most complexity to the job that the SPARQL processor must do to find the relevant data and return it.” My new trick not only made the retrieval much faster; it also made it possible to retrieve a lot more data from a remote endpoint.

First, let’s look at a simple version of the use case. DBpedia has a lot of SKOS taxonomy data in it, and at Taxonomy Boot Camp I’m going to show how you can pull down and use that data. Now, imagine that a little animal taxonomy like the one shown in the illustration here is stored on an endpoint and I want to write a query to retrieve all the triples showing preferred labels and “has broader” values up to three levels down from the Mammal concept, assuming that the taxonomy’s structure uses SKOS to represent its structure. The following query asks for all three levels of the taxonomy below Mammal, but it won’t get the whole taxonomy:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label .
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}

As with any SPARQL query, it’s only going to return triples for which all the triple patterns in the WHERE clause match. While Horse may have a broader value of Mammal and therefore match the triple pattern {?level1 skos:broader v:Mammal}, there are no nodes that have Horse as a broader value, so there will be no match for {?level2 skos:broader v:Horse}. So, the Horse triples won’t be in the output. The same thing will happen with the Cat triples; only the Dog ones, which go down three levels below Mammal, will match the graph pattern in the WHERE clause above.

If we want a CONSTRUCT query that retrieves all the triples of the subtree under Mammal, we need a way to retrieve the Horse and Cat concepts and any descendants they have, even if they have no descendants, and OPTIONAL makes this possible. The following will do this:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  OPTIONAL {
    ?level2 skos:broader ?level1 ;
            skos:prefLabel ?level2label .
  }
  OPTIONAL {
    ?level3 skos:broader ?level2 ;
            skos:prefLabel ?level3label . 
  }
}

The problem: this doesn’t scale. When I sent a nearly identical query to DBpedia to ask for the triples representing the hierarchy three levels down from <http://dbpedia.org/resource/Category:Mammals>, it timed out after 20 minutes, because the two OPTIONAL graph patterns gave DBpedia too much work to do.

As a review, let’s restate the problem: we want the identified concept and the preferred labels and broader values of concepts up to three levels down from that concept, but without using the OPTIONAL keyword. How can we do this?

By asking for each level in a separate query. When I split the DBpedia version of the query above into the following three queries, each retrieved its data in under a second, retrieving a total of 2,597 triples representing a taxonomy of 1,107 concepts:

# query 1
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}


# query 2
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level2 skos:broader ?level1 ;  
            skos:prefLabel ?level2label .  
}


# query 3
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}
WHERE {
?level2 skos:broader/skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level3 skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}

Going from timing out after 20 minutes to successful execution in under 3 seconds is quite a performance improvement. Below, you can see how the beginning of a small piece of this taxonomy looks in TopQuadrant’s TopBraid EVN vocabulary manager. At the first level down, you can only see Afrosoricida, Australosphenida, and Bats in the picture; I then drilled down three more levels from there to show that Fictional bats has the single subcategory Silverwing.

As you can tell from the Mammals URI in the queries above, these taxonomy concepts are categories, and each category has at least one member (for example, Bats as food) in Wikipedia and is therefore represented as triples in DBpedia, ready for you to retrieve with SPARQL CONSTRUCT queries. I didn’t retrieve any instance triples here, but it’s great to know that they’re available, and that this technique for avoiding CONSTRUCT graph patterns will serve me for much more than SKOS taxonomy work.

There has been plenty of talk lately on Twitter and in blogs about how it’s not a good idea for important applications to have serious dependencies on public SPARQL endpoints such as DBpedia. (Orri Erling has one of the most level-head discussions of this that I’ve seen in SEMANTiCS 2014 (part 3 of 3): Conversations; in my posting Semantic Web Journal article on DBpedia on this blog I described a great article that lists other options.) There’s all this great data to use in DBpedia, and besides spinning up an Amazon Web Services image with your own copy of DBpedia, as Orri suggests, you can pull down the data you need to store locally when it is up. If you’re unsure about the structure and connections of the data you’re pulling down, OPTIONAL graph patterns seems like an obvious fix, but this trick for splitting up CONSTRUCT queries to avoid the use of OPTIONAL graph patterns means that you can pull down a lot more data lot more efficiently.

Stickin’ to the UNION

October 16th update: Once I split out the pieces of the original query into separate files, it should have occurred to me to at least try joining them back up into a single query with UNION instead of OPTIONAL, but it didn’t. Luckily for me, John Walker suggested in the comments for this blog entry that I try this, so I did. It worked great, with the benefit of being simpler to read and maintain than using a collection of queries to retrieve a single set of triples. This version only took three seconds to retrieve the triples:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  


}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  {
    ?level2 skos:broader ?level1 ;  
    skos:prefLabel ?level2label .  
  }
  UNION
  {
    ?level2 skos:broader ?level1 .
    ?level3 skos:broader ?level2 ;  
            skos:prefLabel ?level3label .  
  }
}

There are two lessons here:

If you’ve figured out a way to do something better, don’t be too satisfied too quickly—keep trying to make it even better.
UNION is going to be useful in more situations than I originally thought it would.

A schemaless computer database in 1965

Bob DuCharme — Sat, 13 Sep 2014 11:09:09 -0500

To enable flexible metadata aggregation, among other things.

I’ve been reading up on America’s post-war attempt to keep up the accelerated pace of R&D that began during World War II. This effort led to an infrastructure that made accomplishments such as the moon landing and the Internet possible; it also led to some very dry literature, and I’m mostly interested in what new metadata-related techniques were developed to track and share the products of the research as they led to development.

One dry bit of literature is the proceedings of the 1965 Toward a National Information System: Second Annual National Colloquium On Information Retrieval. The conference was sponsored by the American Documentation Institute, who had a big role in the post-war information sharing work, as well as the University of Pennsylvania’s Moore School of Electrical Engineering (where Eckert and Mauchly built ENIAC and its successor EDVAC) and some ACM chapters.

In a chapter on how the North American Aviation company (now part of Boeing) revamped their practices for sharing information among divisions, I came across this description of some very flexible metadata storage:

All bibliographic information contained in both the corporate and divisional Electronic Data Processing (EDP) subsystems is retained permanently on magnetic tape in the form of variable length records containing variable length fields. Each field, with the exception of sort keys, consists of three adjacent field parts: field character count, field identification, and field text (see Figure 3). There are several advantages to this format: it is extremely compact, thereby reducing computer read-write time; it provides for definition and consequent addition of new types of fields of bibliographic information without reformatting extant files; and its flexibility allows conversion of files from other indexing abstracting services.

I especially like that “it provides for definition and consequent addition of new types of fields of bibliographic information without reformatting extant files.” This reminds me of one slide in my presentation last month at the Semantic Technology and Business / NoSQL Now! conferences last month, where my talk was on a track shared by both conferences, about how a key advantage of schemaless NoSQL databases is the ability to add a new value for a new property to a data set with no need for the schema evolution steps that can be so painful in a relational database.

Moore’s law has led to less of a reliance on arranging data in tables to allow the efficient retrieval of that data. The various NoSQL options have explored new ways to do this, and it was great to see that one aerospace company was doing it 49 years ago. Of course, retrieving data from magnetic tape is less efficient than modern alternatives, but it was a big step past the use of piles of punched cards, and pretty modern for its time, as you can see from the tape spools on the picture of EDVAC’s gleaming successor below. I thought it was cool to see that, although tabular representation of data long predates relational databases (hierarchical and network databases also stored sets of entities as tables, but with much less flexibility) that someone had implemented such a flexible model so long ago, especially to represent metadata, with a use case that we often see now with RDF: to allow “conversion of files from other indexing abstracting services”—in other words, to accomodate the aggregation of metadata from other sources that may not have structured their data the same way that yours is structured.

Univac photo by H. Müller CC-BY-SA-2.5, via Wikimedia Commons

Exploring a SPARQL endpoint

Bob DuCharme — Sun, 24 Aug 2014 13:03:27 -0500

In this case, semanticweb.org.

In the second edition of my book Learning SPARQL, a new chapter titled “A SPARQL Cookbook” includes a section called “Exploring the Data,” which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint http://data.semanticweb.org/sparql, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don’t know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using semanticweb.org’s SNORQL interface.

I started with what is generally my favorite query, listing which predicates are used in the data, because that’s the quickest way to get a flavor for what kind of data is available. Several of the predicates that got listed immediately told me some interesting things:

rdfs:subClassOf shows me that there’s probably some structure worth exploring.
dcterms:subject (and dc:subject) shows that things have probably been tagged with keywords.
ical properties such as dtstart shows that events are recorded.
FOAF properties show that there is probably information about people.
dcterms:title, swrc:booktitle, dc:title, src:title, and swrc:subtitle show me that works are covered.

An RDF dataset may or may not have explicit structure, and the use of rdfs:subClassOf in this data showed me that there was, so my next query asked what classes were subclasses of what classes so that I could get an overview of how much structure the dataset included. The result showed me that the ontology seemed to be mostly in the swc namespace, which turns out to be the semanticweb.com conference ontology. The site does include nice documentation for this ontology.

The use of the FOAF vocabulary showed me that there are probably people described, but if the properties foaf:name, foaf:lastName, foaf:familyName, foaf:family_name, and foaf:surname are all in there, which should I try first? A quick ego search showed foaf:family_name being used. It also showed that the URI used to represent me is http://data.semanticweb.org/person/bob-ducharme, and because they’ve published this data as linked data, sending a browser to that URL showed that it described me as a member of the 2010 ISWC program committee.

It also showed me to be a proud instance of the foaf:Person class, so I did a query to find out how many persons there were in all: 10,982.

Given the domain of the ontology and the reason that I was listed, I guessed that it was all about ISWC conferences, so I listed the dc:title values to see what would show up. The query took long enough that I added a LIMIT keyword to create a politer version of that query. Looking at the complete data for one work showed all kinds of interesting information, including an swrc:year value to indicate the year of this paper’s conference. A list of all year values showed a range from 2001 right up to 2014, so it’s nice to see that they’re keeping the data up to date.

Next, I listed all papers that mention “SPARQL” in their title, with their years. After listing the number of papers with SPARQL in their title each year, I used sgvizler (which I described here last September) to create the chart of these figures shown above.

The use of dcterms:subject and dc:subject was interesting because these add some pretty classic metadata for navigating content. Listing triples that used either, I included LIMIT 100 to be polite to the server in case these properties were used a lot. They are. Doing this with dc:subject shows subjects such as “ontology alignment” and “controlled natural language” assigned to articles. Doing it with dcterms:subject showed it used more the way I might use rdf:type, indicating that something is an instance of a particular class: for example, swc:Chair and swc:Delegate each have dcterms:subject values of http://dbpedia.org/resource/Role.

My interest in taxonomies (spurred by my work with TopQuadrant’s TopBraid EVN) led me to look harder at the dc:subject values. They’re string values, and not instances of something like skos:Concept, so they have no hierarchical relationship or other metadata themselves. I’m guessing that this is because key phrases assigned to conference papers are more of a folksonomy, in which people can make up their own key phrases as they wish. Either some people must have been aware of other key phrases in use or some were added automatically, because, while counting how many different ones there were came up with 3,594, a query to see which were the most popular showed that “Corpus (creation, annotation, etc.)” was far and away the most used, with 506 papers having that subject.

I could go on. Call me a SPARQL geek, but I really enjoy looking around a data set like this, especially when (as the presence of the papers for ISWC 2014 shows) the data is kept up to date. For people interested in any aspect of semantic web technology, the ability to look around this particular dataset and count up which data falls into which patterns is a great resource.

When did linking begin?

Bob DuCharme — Sun, 20 Jul 2014 09:40:55 -0500

Pointing somewhere with a dereferenceable address, in the twelfth (or maybe fifth) century.

As I have once before, I’m republishing an entry from an O’Reilly blog I had from 2003 to 2005 on topics related to linking. I’ve been reading up on early concepts of metadata lately—I particularly recommend Ann Blair’s Too Much to Know: Managing Scholarly Information before the Modern Age—and have recently found another interesting reference to the “Regulae Iuris” book mentioned below. When I wrote this, I was more interested in hypertext issues, and if I was going to change anything to update this piece, I would change the word “traverse” to “dereference,” but all the points are still meaningful.

Works about linking often claim that it’s been around for thousands of years, and then they give examples that are no more than a few centuries old. I can only find one reference to something more than a thousand years old that qualifies as a link: Peter Stein’s 1966 work “Regulae Iuris: from Juristic Rules to Legal Maxims” describes some late fifth-century lecture notes on a commentary by the legal scholar Ulpian. The notes mention that confirmation of a particular point can be found in the Regulae (“Rules”) of the third-century Roman jurist (and student of Ulpian) Modestinus, “seventeen regulae from the end, in the regula beginning ‘Dotis’…”. The citation’s explicit identification of the point in the cited work where the material could be found makes it the earliest link that I know of.

Other than Stein’s tantalizing example, all of my research points to the 12th century as the beginning of linking. In a 1938 work on the medieval scholars of Bologna, Italy, who studied what remained of ancient Roman law, Hermann Kantorowicz wrote that in “the eleventh century…titles of law books are cited without indicating the passage, books of the Code are numbered, and the name of the law book is considered a sufficient reference.” He uses this to build his argument that that a particular work described in his essay is from the eleventh century and not the twelfth, as other scholars had argued. Apparently, it was common knowledge in Kantoriwicz’s field that twelfth century Bolognese scholars would reference a written law using the name of the law book, the rubric heading, and the first few words of the law itself. (Referencing of particular chapters and sections by their first few words was common at the time; the use of chapter, section, and page numbers didn’t begin until the following century.)

Italian legal scholars trying to organize and make sense of the massive amounts of accumulated Roman law contributed a great deal to the mechanics of the cross-referencing that provide many of the earliest examples of linking. The medievalist husband and wife team Richard and Mary Rouse also found some in their research into evolving scholarship techniques in the great universities of England and France (that is, Oxford, Cambridge, and the Sorbonne) and they described Gilbert of Poitiers’s innovative twelfth-century mechanism for addressing specific parts of his work on the psalms: he added a selection of Greek letters and other symbols down the side of each page to identify concepts such as the Penitential Psalms or the Passion and Resurrection. If you found the symbol for the Passion and Resurrection in the margin of Psalm 2 with a little 8 next to it (actually, a little “viii”—they weren’t using Arabic numerals quite yet), it would tell you that the next discussion of this concept appeared in Psalm 8. Once you found the same symbol on one of the eighth psalm’s pages, you might find a little “xii” with it to show that the next discussion of the same concept was in Psalm 12. This addressing system made it possible for someone preparing a sermon on the Passion and Resurrection to easily find the relevant material in the Psalms. (In fact, aids to sermon preparation was one of the main forces in the development of new research tools, as clergymen were encouraged to go out and compete with the burgeoning heretic movements for the hearts and minds of the people.)

The use of information addressing systems really got rolling in the thirteenth-century English and French universities, as scholarly monks developed concordances, subject indexes, and page numbers for both Christian religious works and the classic ancient Greek works that they learned about from their contact with the Arabic world. In fact, this is where Arabic numbers start to appear in Europe; page numbering was one of the early drivers for its adoption.

Quoting of one work by another was certainly around long before the twelfth century, but if an author doesn’t identify an address for his source, his reference can’t be traversed, so it’s not really a link. Before the twelfth century, religious works had a long tradition of quoting and discussing other works, but in many traditions (for example, Islam, Theravada Buddhism, and Vedic Hinduism) memorization of complete religious works was so common that telling someone where to look within a work was unnecessary. If one Muslim scholar said to another “In the words of the Prophet…” he didn’t need to name the sura of the Qur’an that the quoted words came from; he could assume that his listener already knew. Describing such allusions as “links” adds heft to claims that linking is thousands of years old, but a link that doesn’t provide an address for its destination can’t be traversed, and a link that can’t be traversed isn’t much of a link. And, such claims diminish the tremendous achievements of the 12th-century scholars who developed new techniques to navigate the accumulating amounts of recorded information they were studying.

Integrating hiphop vocabulary scores with other relevant data—then querying it

Bob DuCharme — Tue, 10 Jun 2014 08:41:16 -0500

With a little JSON + DBpedia integration.

About a month ago, media outlets ranging from NPR to Rolling Stone to Britain’s Daily Mail reported on how a “designer, coder, and data scientist” named Matt Daniels had analyzed the number of unique words in samples of work by Shakespeare, Herman Melville, and 85 rappers. He then published a chart and article about how their scores related to each other. The highest score went to Aesop Rock, who I thought I’d heard of but hadn’t—I was confusing him with A$AP Rocky, who was not included in the survey.

The chart and discussion were interesting, but what I really wanted to see was the complete list of subjects with their scores, and after searching around the web a bit I found that it was under my nose the whole time—the chart is dynamically generated from JSON embedded in his web page. So, I converted that JSON to RDF, used some SPARQL to retrieve additional data about each rapper from DBpedia such as their record labels, the years their careers began, any subject keywords assigned to them, and the abstracts, or summaries of their careers. (You’ll find more details on the procedure for doing this below; the resulting integrated data is available for you to query here as a Turtle file.) Combining this additional data with the vocabulary scores let me do some interesting queries and provide an excellent example of how RDF and SPARQL let you perform ad hoc data integration to combine different data sets into aggregates that let you identify new patterns and other information.

For example, of all record labels with more than four rappers associated with them, I found that MCA’s roster had the highest average vocabulary score at 5472.5, well above the overall average of 4624. Who are these artists? Another simple query showed their names and scores:

GZA 6426 The Roots 5803 Killah Priest 5737 Blackalicious 5480 Big Daddy Kane 4768 Rakim 4621

(As Daniels pointed out, members of the Wu-Tang Clan tend to have higher scores, so GZA and Killah Priest are a big help to MCA’s average score.)

The dcterms:subject values assigned to the rappers in DBpedia provide the most interesting opportunities for exploration. In fact, it turned out that I didn’t even need to pull down the record label values, because they each have corresponding dcterms:subject values. For example, each of the artists listed above have a dcterms:subject value of http://dbpedia.org/resource/Category:MCA_Records_artists along with their other dcterms:subject values.

Of the subject categories with more than four rappers, here are several interesting ones with high average scores, ranked by number of members in the category:

                                         count   avg score

Members of the Nation of Gods and Earths 13 5117 Underground rappers 8 5849 People from Brooklyn 7 5323 MCA Records artists 7 5401 Rappers from Long Island 6 5160 Alternative hip hop groups 5 5286 Wu-Tang Clan members 5 5611

I hadn’t heard of the Nation of Gods and Earths, also known as the Five-Percent Nation; again, we have Wu-Tang skewing the numbers up. After I saw the high averages for “People from Brooklyn” and “Rappers from Long Island” but no mention of Staten Island, I clicked around and found out that only about half of Wu-Tang came from the borough in which they were based, which I never knew before.

Here are some interesting low scoring categories. Again, remember that the overall average score is 4624:

                                                   count   avg score

Participants in American reality television series 8 4108 People convicted of drug offenses 7 3741 American philanthropists 6 4022 American shooting survivors 5 4025 American fashion businesspeople 5 4110

Of course, the data collection itself isn’t very scientific; what constitutes an “alternative” rapper? A less successful artist popular with music nerds? “People convicted of drug offenses” seems like a more cut and dried category, but remember that data from a Wikipedia page is not an authoritative source for such facts.

As with the list of MCA artists above, a simple query of the data can tell you who falls in each of these categories, so pull down the data from the link above and have fun querying it. If you’re interested in how I did the integration, read on.

Integrating the data

Upon seeing that Daniels includes a score for Ghostface Killah, it’s easy to ask DBpedia for all the { <http://dbpedia.org/resource/Ghostface_Killah> ?p ?o } triples. It’s not as simple for many other artists, though, for several reasons:

Some rappers use stage names that are common phrases and words, so putting that name at the end of “http://dbpedia.org/resource/" won’t necessarily get you data about them.
Tricky spellings and punctuation are pretty common in hiphop names. For example, Jay Z originally spelled his name with a hyphen but later dropped it, much as LexisNexis did twelve years earlier.
Daniels sometimes included qualifications in names (“GZA (only solo albums)”), included or didn’t include the word “The” that was in the DBpedia name (“Roots” vs. “The Roots”) or just spelled their names wrong, such as omitting the final “t” from “Missy Elliott.”

Dropping parenthesized qualifications was easy enough. Even better, DBpedia often has the data necessary to find the page based on a slightly wrong name, and the techniques I described in Normalizing company names with SPARQL and DBpedia worked for most of them. This is not a minor point: even when the names aren’t quite right, sending the right SPARQL queries to DBpedia can still retrieve valuable data about them. This has applications in all kinds of domains.

You can find the scripts and queries mentioned below in rapperrdf.zip. The rapperdata.js file is taken directly from the source of Daniels’ web page, and loads his data into an array. Another JavaScript file, rappervocab.js, loads rapperdata.js and outputs Turtle RDF of the rapper’s scores and the Daniels versions of their names. (If you’re using the TopBraid platform and working with JSON, there’s an excellent SPARQLMotion module to automate the conversion of any JSON to RDF.) I used Rhino to run the JavaScript, as I described in Javascript from the command line.

Another short script called rapperValuesList.js reads the same data and creates the list of names that I inserted as a VALUES list into the retrieveRapperData.rq SPARQL query that actually retrieves the relevant data from DBpedia. (VALUES is a great SPARQL technique for saying “I need data about this list of specific things,” as I’ve written here before.) This SPARQL query uses the SERVICE keyword to send the request off to DBpedia and does a CONSTRUCT to save the triples. It uses the “Normalizing company names” trick mentioned above to see if the Daniels name with the parenthesized part stripped out is either the “official” rdfs:label value for a resource or otherwise attached to something that gets redirected to that.

Of the 81 artists in Daniels’ list, there were 12 whose names couldn’t be looked up even with the redirect trick in retrieveRapperData.rq. To account for these, I created extraRapperDanielsNames.ttl with a text editor to link Daniels’ names for these 12 extra rappers to their DBpedia resource URIs such as http://dbpedia.org/resource/Common_(entertainer), which I had to look up manually. The retrieveExtraRapperData.rq query then uses that to retrieve the same data about those 12.

The queries only retrieve the start year, record label, abstract, and subjects about the artists because they all had those values. Retrieving data that only some of them have (such as the birth year, which you don’t have for bands like The Roots) would mean using the OPTIONAL keyword, and DBpedia said that my query would take too long when I tried that—I’m sure the big VALUES part has a lot to do with that.

The integrateRapperData.rq query reads the extraRapperDanielsNames.ttl data and the data created by rappervocab.js, retrieveRapperData.rq, and retrieveExtraRapperData.rq, and then creates the final product: rapperDataIntegrated.ttl.

Querying the data

Next was the fun part: executing queries to explore that integrated data. The zip file includes queries to find the following information from rapperDataIntegrated.ttl:

averageScore.rq overall average Daniels score averageScoreByLabel.rq average score by record label for labels with more than four artists associated with them subjectReport.rq average score by subject associated with the rappers for all subjects (like “Underground rappers” and “American philanthropists”) MCAArtists.rq MCA artists JamaicanDescent.rq the name, Daniels score, and abstract of “American rappers of Jamaican descent”

That last one can provide a template for the creation of other queries about who falls into which subject categories.

Linking this data with other data about the artists from some of the blue parts of the Linked Data Cloud such as DBTune or the BBC would provide some even more interesting possibilities. As one taste, this link has a SPARQL query that retrieves all the MusicBrainz data about Missy Elliott.

"Experience in SPARQL a plus"

Bob DuCharme — Fri, 09 May 2014 09:06:17 -0500

The long tail story of SPARQL success: appearances in job postings.

When people talk about semantic web or linked data success stories, they usually talk about the big, well-known projects such as those at BestBuy, the BBC, NASA, life sciences companies, the whole vocabulary and taxonomy management industry, and the growing use of DBpedia by a range of companies. I’ve always found that a company’s job postings provide interesting clues about their potential technology directions, and the increasing references to SPARQL in these postings is another positive trend. These fly further under most radars than the projects mentioned earlier, but their volume adds up to a real long tail, in the Chris Anderson sense of the word.

For a while now, I’ve had a saved search for appearances of “SPARQL” on the job posting site indeed.com so that I could occasionally mention companies looking for SPARQL experience in the Twitter feed for my book Learning SPARQL. I’ve mostly limited it to high-profile, brand name companies because there are really too many to mention them all; last Saturday’s email from indeed.com listed six positions ranging from Xerox (who has two positions open) to Axius Technologies in Hoboken, who I’ve never heard of.

I thought it would be fun to review the names of the companies I’ve tweeted about and make a list of the most well-known ones, and as you can see below, it’s an impressive list. Sometimes these companies bury the mention of SPARQL deep down in their descriptions of duties for the Java developer or “solution architect” that they seek, but others, like Xerox, say right in the job title that they want a Database-SPARQL Developer.

What does this mean? It means that the use of RDF and SPARQL is really getting traction at the grass roots level, as large and small companies move beyond side projects for investigating the technology to projects that require RDF and SPARQL enough to influence their hiring budgets. That’s some nice progress.

Accenture Morgenthaler Life Science

Amazon NBC Entertainment

AstraZeneca Nokia

Bank of America Northrop Grumman

Boeing Orbis

Boston Public Library Pearson

Children’s Hospital of Los Angeles Pitney Bowes

Columbia University’s Lamont-\ Reed Elsevier Doherty Earth Observatory

Comcast SAIC

Craig Venter Institute SAP

Deloitte Sears

Elsevier Siemens

Ely Lilly Socrata

Goldman Sachs Sony

Google Stanford University

Harvard Medical School Thomson Reuters

IBM Global Business Services Turner Broadcasting

JP Morgan Chase Vodafone

Lockheed Martin Xerox

Los Alamos National Laboratory Yahoo

Mayo Clinic Yale University library

Microsoft Zoominfo

RDF lists and SPARQL

Bob DuCharme — Mon, 21 Apr 2014 08:35:38 -0500

Not great, but not terrible, and a bit better with SPARQL 1.1

I have yet to ever say to myself "what I need here is an RDF collection, which I will implement with lots of ``rdf:first`` and ``rdf:rest`` triples!"

That fact that RDF expresses everything using the same simple three-part data structure has usually been a great strength, but in the case of ordered lists (or RDF collections) it’s pretty messy. The specification defines a LISP-like way of using triples to identify, for each position in a list, what the first member is and what list has the rest of them after that. When saying “and here are the rest” for every member of the list, you don’t want to have to come up with a unique URI for each one, so datasets typically use blank nodes for these placeholders, and you can end up with a lot of them.

Putting all this together, you could represent the list (“one”, “two”, “three”, “four”, “five”) with these triples:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix d:   <http://learningsparql.com/ns/data#> .


d:myList d:contents _:b1 .


_:b1 rdf:first "one" .
_:b1 rdf:rest _:b2 .


_:b2 rdf:first "two" .
_:b2 rdf:rest _:b3 .


_:b3 rdf:first "three" .
_:b3 rdf:rest _:b4 .


_:b4 rdf:first "four" .
_:b4 rdf:rest _:b5 .


_:b5 rdf:first "five" .
_:b5 rdf:rest rdf:nil .

Turtle and SPARQL include syntax that lets you write out a more human-readable version without explicit blank nodes and with the list represented as, well, a list. The following is the equivalent of the example above:

@prefix d: <http://learningsparql.com/ns/data#> .


d:myList d:contents ("one" "two" "three" "four" "five")

To do much with these lists, though, especially in SPARQL, you still have to think in terms of rdf:first and rdf:rest.

To be honest, I’ve never found much need to do anything with RDF lists, but after seeing recent references to them—or, in Manu Sporny’s case, the lack of them—I thought I’d play around a bit to see how difficult it was in SPARQL to do four basic list tasks:

Retrieve the Nth member of a list
Retrieve all the members of a list
Insert a new member at a specified position
Delete a member from a specified position

Update after posting my original entry: Andy Seaborne pointed me to his 2011 blog entry Updating RDF Lists with SPARQL, which includes SPARQL queries covering several additional cases. Also, more from Joshua Taylor at stackoverflow, thanks to Paula Gearon.

I found that SPARQL 1.1’s property paths made it easier to concisely address a specific list member without lots of triple patterns, and of course without SPARQL 1.1 update there would be no insertion or deletion of list members. (I’m happy to take suggestions on improving the queries.)

Retrieving the Nth member

The following query retrieves the third member from the list defined above:

PREFIX d: <http://learningsparql.com/ns/data#>


SELECT ?item
WHERE {
  d:myList d:contents/rdf:rest{2}/rdf:first ?item
}

If you think of it as zero-based counting, it’s simple: you just plug the number of the member you’re interested in into the curly braces. Using ARQ, the query returns this:

-----------
| item    |
===========
| "three" |
-----------

But… after writing and testing that, I remembered that the ability to specify a specific number of repeated property path steps by putting a number between curly braces was dropped in the 24 July 2012 Working Draft of the SPARQL 1.1 Query spec, so it’s not proper SPARQL. It works just the same when you replace rdf:rest{2} with rdf:rest/rdf:rest, which is a minor change, but specifying every step like that will be a pain if you want to retrieve the twenty-third member of the list.

I’ve replaced the rdf:rest{2} that was in the original draft of the insert and delete queries below with rdf:rest/rdf:rest.

Retrieving all the members

The following retrieves all of the list items. As an added bonus, ARQ displayed them in order, but that was just luck, and not something to count on, because stored triples have no order.

PREFIX d:   <http://learningsparql.com/ns/data#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


SELECT ?item
WHERE {
  d:myList d:contents/rdf:rest*/rdf:first ?item
}

Of course, changing the first line to SELECT (count(?item) AS ?items) would give you the number of members in the list, which is also handy.

Inserting a new member at a specific position

The main work is breaking the link where the insertion will take place and then linking the new member in.

PREFIX d: <http://learningsparql.com/ns/data#>
DELETE {
  ?insertionPoint rdf:rest ?rest . 
}
INSERT {
  _:b1 rdf:first "threePointFive" ; rdf:rest ?rest . 
  ?insertionPoint rdf:rest _:b1 . 
}
WHERE {
  d:myList d:contents/rdf:rest/rdf:rest/rdf:first ?item .
  ?insertionPoint rdf:first ?item ; rdf:rest ?rest . 
}

Here is how the dataset looks after using TopBraid Composer to run this query on the data above:

@prefix d: <http://learningsparql.com/ns/data#> .
d:myList
  d:contents (
      "one"
      "two"
      "three"
      "threePointFive"
      "four"
      "five"
    ) ;
.

Deleting a member from a specified position

The following deletes the third item from the list. As with the previous query, the main work is breaking the link and creating a new one across the gap where the deleted item was:

PREFIX d: <http://learningsparql.com/ns/data#>
DELETE {
  ?previousMember rdf:rest ?deletionPoint .
  ?deletionPoint rdf:rest ?rest . 
  ?s ?p ?item   . 
  ?item ?s ?p . 
}
INSERT {
  ?previousMember rdf:rest ?rest.
}
WHERE {
  d:myList d:contents/rdf:rest/rdf:rest/rdf:first ?item .
  ?deletionPoint rdf:first ?item ;  rdf:rest ?rest . 
  ?previousMember rdf:rest ?deletionPoint .
  ?s ?p ?item . 
  OPTIONAL { ?item ?s ?o . }
}

Running this update request after running the insertion one before it results in a dataset that looks like this:

@prefix d: <http://learningsparql.com/ns/data#> .
d:myList
  d:contents (
      "one"
      "two"
      "threePointFive"
      "four"
      "five"
    ) ;
.

So we know it worked.

Taking it further

I won’t remember the syntax of these queries without reviewing them as written here, but I know that I can copy them from here and paste them elsewhere with minor modifications to perform these basic list manipulation goals.

On the other hand, in the work I’ve done with RDF and SPARQL, I have yet to say to myself “what I need here is an RDF collection, which I will implement with lots of rdf:first and rdf:rest triples!” So, the exercise above seems a bit academic. (In fact, my original goals above look like a homework assignment; for extra credit, modify the queries so that the targets can be specified based on their values and not their positions.) If I need to order some instances in RDF, I’m more likely to give them some property I can use to sort them. I’d love to hear pointers from anyone about places where using rdf:first and rdf:rest addressed a data modeling issue better than any alternative would.

Still, the queries above show that maybe RDF collections are not as bad as I originally thought, and that SPARQL 1.1 property paths can make certain tasks more straightforward to achieve.

Easier querying of strings with RDF 1.1

Bob DuCharme — Sat, 08 Mar 2014 10:09:38 -0500

In which a spoonful of syntactic sugar makes the string querying go down a bit easier.

If it looks and walks and talks like a string...

The recent publication of RDF 1.1 specifications fifteen years and three days after RDF 1.0 became a Recommendation has not added many new features to RDF, although it has made a few new syntaxes official, and there were no new documents about the SPARQL query language. The new Recommendations did clean up a few odds and ends, and one bit of cleanup officially removes an annoying impediment to straightforward querying of strings.

Near the beginning of chapter 5 of my book Learning SPARQL, I wrote

Discussions are currently underway at the W3C about potentially doing away with the concept of the plain literal and just making xsd:string the default datatype, so that "this" and "this"^^xsd:string would mean the same thing.

When dealing with the difference between simple literals and those that were explicitly cast as xsd:string values, casting in one direction or the other with the str() and xsd:string() functions gave us a workaround, but once all the query engines catch up with RDF 1.1 we won’t have to work around this anymore.

The 2011 document StringLiterals/LanguageTaggedStringDatatypeProposal describes the problem in more detail, but here’s a short example. Imagine that you want to query for the author of one of the works listed in these triples:

@prefix dc:  <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ls:  <http://learningsparql.com/id#> . 


ls:i1001 dc:creator "Jane Austen" ;
         dc:title "Persuasion" .
ls:i1002 dc:creator "Nathaniel Hawthorne" ;
         dc:title "The Scarlet Letter"^^xsd:string .

For example, let’s say you want to know who wrote “The Scarlet Letter” and you enter this query:

PREFIX dc:  <http://purl.org/dc/elements/1.1/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 


SELECT ?author WHERE { 
  ?work  dc:title "The Scarlet Letter" ; 
         dc:creator ?author . 
}

Using a SPARQL engine that was strictly compliant with RDF 1.0, this query wouldn’t find anything, because the dc:title value of ls:i1002 is the typed literal "The Scarlet Letter"^^xsd:string and not the untyped string that the query was looking for. If a similar query asked for the author of "Persuasion"^^xsd:string, it wouldn’t find anything, because the query is looking for a string that has been explicitly typed as an xsd:string, and in the data the value is an untyped literal.

This, in fact, is what happens with release 2.6.4 of Sesame, the version currently on my hard disk. Sesame is now up to 2.7.10, and, seeing the change coming, may have accounted for it by now. ARQ and the TopBraid platform stopped distinguishing between simple literals and typed string literals several years ago.

Treating the simple literal and typed string versions of a string as the same thing is now officially what’s supposed to happen. According to section 3.3 of the new RDF 1.1 Concepts and Abstract Syntax Recommendation, “Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string”. In other words, if it looks and walks and talks like a string, treat it like a string.

With this update, there’s nothing to hold back other SPARQL engines from treating simple literals and typed string literals the same way. This is going to make the development of a lot of SPARQL queries a little bit simpler.

Querying my own MP3, image, and other file metadata with SPARQL

Bob DuCharme — Sun, 09 Feb 2014 11:31:20 -0500

And a standard part of Ubuntu.

Ubuntu has a utility called Tracker that makes it easy to search your hard disk, a bit like the old Google Desktop with a few extra features. One extra feature ranks among the coolest SPARQL applications I’ve ever seen: the ability to execute SPARQL queries against data extracted from files on your hard disk.

To install it, I did a sudo apt-get install of tracker-gui to get the base parts of tracker and then did a similar installation of tracker-utils to get the SPARQL query utility. Next, I added the Ubuntu applications “Desktop search” and “search and indexing” as applications and used the latter to search and index 94 GB of MP3s and some image files. The indexing took a few hours. (tracker-control -S was a handy command for checking on the indexing progress.) The worldofgnome.org page Indexing preferences in GNOME 3.8 was helpful for understanding the indexing options.

Once the file metadata is indexed, the tracker-sparql command-line utility lets you query it. For example, the following runs the query stored in bea.spq against the metadata:

tracker-sparql -f bea.spq

(The tracker-sparql help said that I was also supposed to include -q to show that it was a SPARQL query, but it seemed to work fine without this command line switch.) The following shows bea.spq, a query for artist names that begin with “Bea”, allowing for an optional “The " before that:

PREFIX nmm: <http://www.tracker-project.org/temp/nmm#>
SELECT DISTINCT ?artistName WHERE {
        ?artist a nmm:Artist . 
       ?artist nmm:artistName ?artistName .
       FILTER(regex(?artistName,"^(The )?Bea"))
}

Here is the output:

Results:
  Beachwood Sparks
  Beastie Boys/Beck/Dust Brothers
  Beastie Boys/Dust Brothers
  Beatles
  The Beach Boys
  The Beastie Boys
  The Beatles
  The Beatniks

One frustrating thing about tracker-sparql is that it rejects certain queries because, as it tells us, “Unrestricted predicate variables not supported.” In my experience, this meant that you couldn’t have a variable in a triple pattern’s predicate position if there was another one in the subject position. So, for example, while I know that the Dust Brothers have worked with the Beastie Boys and Beck separately, I’ve never heard of all of them working together, but I couldn’t enter a query to see which work was created by an artist with a nmm:artistName value of “Beastie Boys/Beck/Dust Brothers”. I did try dc:contributor, nmm:performer, and some other properties that were used to connect an artist to a work, but with no luck. (My guess: it was some sort of remix that combined a few Dust Brothers works.)

This was a fun query, asking what values of “genre” were stored in my MP3s:

SELECT DISTINCT ?genre WHERE
{
  ?work nfo:genre ?genre
}

The results:

Results:
  Jazz
  Rock
  Classical
  New Wave
  Avantgarde
  Pop
  Salsa
  Blues
  Soundtrack
  RETRO SWING
  Swing
  Country
  Other
  Sound Clip
  jazz
  Latin
  Lo-Fi
  Rock & Roll
  Hip-Hop
  Techno-Industrial
  Euro-Techno
  Booty Bass
  Alternative
  Reggae
  Indian
  Podcast
  Electronic

This can lead to a real rabbit hole of additional queries as I wonder “what do I have in that category?” but I’ll spare you that part.

tracker-sparql has a few command line options that are shortcuts to common queries for exploring a dataset. For example, -c lists classes, and gave me a list of 230. A query for distinct rdf:type values showed only 67 being used in my file metadata, so I assume that -c refers to classes that are declared in an internal schema. The tracker-stats utility shows how many instances each class has. (The “SEE ALSO” section of the help page for tracker-store had the best list I could find of the various tracker utilities.)

The tracker indexer also pulls fairly typical metadata out of image files. Unfortunately, it doesn’t pull latitude and longitude data out when present, but it does let you add and query tag values in images. I played with this using the image file above, which shows a paper lantern with the anarchy symbol that I saw in San Francisco’s Chinatown during the 2010 Semantic Technologies conference. Using the tracker-tag utility, I added a tag to the image like this:

tracker-tag --add=anarchy /my/path/semtech/2010/pics/IMG_5257.jpg

This added the following triples to the dataset:

@prefix nao:  <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#> . 
@prefix tr:   <http://www.tracker-project.org/ontologies/tracker#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix nao:  <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#> . 


<urn:uuid:5aa32bbc-7f08-da08-3bbd-8ae6650411fb> nao:hasTag  
  <urn:uuid:a49c693c-d439-529b-8e27-296d589e905c> . 


<urn:uuid:a49c693c-d439-529b-8e27-296d589e905c>
  tr:added "2014-01-18T22:31:44Z" ;
  tr:modified 7170 ;
  rdf:type rdfs:Resource ;
  rdf:type  nao:Tag ;
  nao:prefLabel "anarchy" .

The first triple says that the image resource has a particular tag, and the remaining triples tell us about that tag. It was nice to see that the tag is a resource and not just a string, so it can be renamed without losing its relationships with tagged resources. It also means that the tag itself can have additional metadata assigned to it such as skos:broader values to create a taxonomy hierarchy. And of course, there are all kinds of possibilities for SPARQL queries about what is tagged with what. (It would be fun to pull a set of nao:Tag resource triples into TopBraid EVN and really turn them into a proper SKOS taxonomy.)

A few random closing notes:

I tried a few SPARQL 1.1 features like BIND and contains() with no luck, but the tracker-sparql help page does show that the count() function and SPARQL UPDATE are supported. I tried adding a triple with an UPDATE request, but I didn’t get it to work. If it was possible to add arbitrary triples about existing resources, we could store additional data about them such as the skos:broader values mentioned above and triples about the latitude and longitude where the picture was taken, which ExifTool can extract from image files. Apache Tika, which I’ve written about here before, would also be great to throw into the mix.
It’s interesting that the resources were identified with URNs instead of URLs.
The Adrian Perez blog post Some Tracker + SPARQL bits has some good tips, and it points to two blog entries by Adrien Bustany that describe some nice predicate functions built into Tracker’s SPARQL engine.
It was nice to see the Nepomuk ontology used here. Talk about a semantic desktop! (Since writing the first draft of this, I have learned that the next generation of Nepomuk is not using RDF, which I was sorry to hear.) It would be nice to see a schema for the Tracker-specific classes and properties; the http://www.tracker-project.org/ontologies base URI used for some of the namespaces currently doesn’t go anywhere. (If someone can point me to such a schema, I’d be happy to update this.)
The metadata that the indexer pulled from a PDF on my hard disk included the complete text of the PDF stored using the nie:plainTextContent property. That could be very useful for searches and text extraction.

Playing with this dataset, if I limited myself to SPARQL queries about my own MP3s, I could stay busy for hours. Assigning, querying, and curating tags (while I assigned one to a JPEG file above, they could be assigned to any resources), as I mentioned above, is something else that would be a lot of fun to play with. For example, imagine running some text analytics on nie:plainTextContent values to come up with tag values to assign to that PDF. And, if music files have an artist property and PDFs have a plainTextContent property, there are probably plenty of other properties that are specific to certain file types and reveal interesting things about them—especially when queried with SPARQL to find patterns among the values of the files in your own collection.

Storing and querying RDF in Neo4j

Bob DuCharme — Tue, 07 Jan 2014 08:56:28 -0500

Hands-on experience with another NoSQL database manager.

In the typical classification of NoSQL databases, the “graph” category is one that was not covered in the “NoSQL Databases for RDF: An Empirical Evaluation” paper that I described in my last blog entry. (Several were “column-oriented” databases, which I always thought sounded like triple stores—the “table” part of they way people describe these always sounded to me like a stretched metaphor designed to appeal to relational database developers.) A triplestore is a graph database, and Brazilian software developer Paulo Roberto Costa Leite has developed a SPARQL plugin for Neo4j, the most popular of the NoSQL graph databases. This gave me enough incentive to install Neo4j and play with it and the SPARQL plugin.

While this plugin has a ways to go before people can get serious work done with it, it's still a great start and fun to play with.

To quote Neo4j’s home page, it’s “a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases.” According to the popular NoSQL introduction Seven Databases in Seven Weeks, Neo4j “can store tens of billions of nodes and as many edges.” The ability to distribute a database across a cluster is another thing that makes Neo4j popular.

From what I can tell, at least on Windows, you don’t want the installer version of Neo4j on its download page, because that doesn’t create a plugins directory where you can add the SPARQL one, so get the zip version. I got release 1.9.5 of that one.

I don’t know much about Neo4j except some basics that I read in the “Seven Databses” book, so please forgive any basic misunderstandings or big deviations from standard Neo4j practices. Once I installed it and started it up with bin\neo4j.bat, I sent a browser to the main screen at http://localhost:7474 to make sure that I had installed it properly. This all worked fine; installation was really just a matter of unzipping, once I determined the right distribution to unzip.

To install the SPARQL plugin, I downloaded the distribution zip file from its github page (not to be confused with the project’s github page, which has the source), unzipped that inside of the neo4j-community-1.9.5\plugins folder, and restarted neo4j (that is, I shut it down with a ^C in the terminal window that it created when I started it up, then started it again the same way I did originally).

Inserting data

I like to use curl to test RESTful (or REST-ish) interfaces, and found that I had better luck interacting with Neo4j by using curl from the cygwin sh shell under Windows than using it with the native Windows command line prompt. Following some examples in the SPARQL plugin’s documentation, I tried the following, which successfully inserted some data. (Assume that all curl command lines shown here were actually executed as a single line.)

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  --data-binary @sampledata.txt 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad

The sampledata.txt file named in that command line had this in it:

{ 
  "s" : "http://neo4j.org#jim",  
  "p" : "http://neo4j.org#knows",  
  "o" : "http://neo4j.org#mitch",  
  "c" : "http://neo4j.org" 
}

Note that it’s inserting a quad, not a triple, with “c” being a named graph. I’m guessing that the “c” stands for “context” because the plugin uses a lot of Sesame jar files.

The following successfully inserted a similar query with the quad specified on the command line:

curl -X POST -H Content-Type:application/json -H 
   Accept:application/json 
   http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad  
   -d '{  "s" : "http://neo4j.org#joe",  "p" : "http://neo4j.org#knows",  
   "o" : "http://neo4j.org#sara",  "c" : "http://neo4j.org"}'

This worked to insert a literal string,

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad -d 
  '{  "s" : "http://neo4j.org#joe",  "p" : "http://learningsparql.com/ns/data#lastName", 
  "o" : "\"Schmoe\"",  "c" : "http://learningsparql.com/ns/data#test1/"}'

and this inserted a value with an explicit type:

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad  -d 
  '{  "s" : "http://neo4j.org#joe",  "p" : "http://learningsparql.com/ns/data#hireDate", 
  "o" : "\"2012-11-09\"^^<http://www.w3.org/2001/XMLSchema#date>",  "c" : 
  "http://learningsparql.com/ns/data#test1/"}'

Querying

With this SPARQL query stored in neo4jquery1.json,

{
  "query" : "SELECT * WHERE { ?s <http://neo4j.org#knows> ?o .}"
}

I entered this at the cygwin sh prompt,

curl -X POST -H Content-Type:application/json -H Accept:application/json  
   --data-binary @neo4jquery1.json 
   http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/execute_sparql

and got this result:

[ {
  "s" : "http://neo4j.org#jane",
  "o" : "http://neo4j.org#jim"
}, {
  "s" : "http://neo4j.org#joe",
  "o" : "http://neo4j.org#sara"
} ]

I found it best to execute queries from a stored file like that, because although JSON won’t let me spread a string (in the case, the query itself) across multiple lines, it was still a little easier than packing it into a curl command line with the other parameters.

A similar command line executed this query, which specifies the named graph whose triples should be returned:

{
  "query" : "SELECT * WHERE { GRAPH <http://neo4j.org>  {?s ?p ?o }  }"
}

I tried a few random SPARQL 1.1 features such as BIND and COUNT, and they worked fine. Because most of the Sesame JAR files say “2.6.10,” which is only a little more than a year old, I’m guessing that the support of the SPARQL 1.1 query language is pretty complete.

The plugin currently does not support the SPARQL UPDATE language. Deleting the data inserted above would require the use of native Neo4j commands, which would require you to know the internal Neo4j identifiers used for the nodes and edges that represent RDF resources and predicates. Perhaps a bit ironically to RDF people, these identifiers are URIs, but they will rarely be universally unique; for example, my URI http://neo4j.org#mitch was actually stored with the URI http://localhost:7474/db/data/node/7, a URI that very likely refers to other resources on other Neo4j installations that use the default system name and port number of localhost:7474. (I assume that much of Paulo’s work in building the query plugin was mapping from the SPARQL URI references to the internal Neo4j references.)

The plugin, JSON, and the future

You’ve probably noticed that all the input and output to this SPARQL plugin is always JSON: you send data and queries to Neo4j embedded in JSON, and your results are JSON, but not the W3C SPARQL Query Results JSON Format. This use of JSON isn’t specific to Paulo’s plugin, but a default for the Neo4j REST API, which currently provides the context for all SPARQL-oriented communication with a Neo4j server. While the plugin’s documentation refers to an endpoint, it’s not a SPARQL endpoint in the sense that it supports the SPARQL Protocol (the “P” in “SPARQL”), but an endpoint that, at this point, has its own interface for accepting SPARQL queries and delivering results.

The insert_quad and execute_sparql methods shown above are currently the only two that the plugin offers, and as you might guess from the singular form of “insert_quad,” it can only insert one at a time. For now, inserting multiple quads will mean either multiple calls to this method or digging down into the lower levels of the plugin.

So, while this plugin has a ways to go before people can get serious work done with it, it’s still a great start and fun to play with. I don’t want to finish this with a discussion of the RDF features that it’s missing, but instead with some mentions of the cool Neo4j things that would be great to try with RDF. I’ve already mentioned the ease with which data can apparently be distributed across clusters; another is Neo4j’s built-in shortest path algorithm(s), something I’ve always wanted for an RDF store.

I look forward to Paulo’s future work, and I’d like to thank him for helping this Neo4j neophyte get this far with Neo4j and with his plugin.

6/23/14 update: I have just discovered Michael B’s Importing ttl (Turtle) ontologies in Neo4j from over a year ago. It describes things mostly in terms of Java source code, so I’m not about to jump on it and try it out right away, it but will make a good resource for people interested in using RDF in Neo4j. And, the fact that he’s an IBM employee makes it more interesting.

1/19/2015 update: You may also be interested in my recent Twitter exchange with Neo4j’s chief scientist after he said that Neo4j supports SPARQL and pointed to Paulo’s library and this blog entry.

Storing (and querying) RDF in NoSQL database managers

Bob DuCharme — Wed, 04 Dec 2013 08:36:31 -0500

Interesting progress, carefully measured.

"...we are confident that NoSQL databases will present an ever growing opportunity to store and manage RDF data in the cloud."

A little over a year ago, in a blog entry titled SPARQL and Big Data (and NoSQL), I wrote this:

What I’d love to see, and have heard about tentative steps toward, would be SPARQL endpoints for some of these NoSQL database systems. The D2RQ and R2RML work have accomplished things that should be easier for graph-oriented NoSQL databases like Neo4J and, if I understand the quote above [from Edd Dumbill’s Planning for Big Data] correctly, for column-oriented NoSQL databases as well. Google searches on SPARQL and either Hadoop, Neo4J, HBase, or Cassandra show that some people have been discussing and even doing a bit of coding on several of these.

Discussions and bits of coding are nice, but I recently found something much better in a paper titled “NoSQL Databases for RDF: An Empirical Evaluation” (pdf)—a methodical comparison of the storage and querying of RDF in different NoSQL systems. This ISWC 2013 paper, written by ten authors from four universities in four countries, included this in its abstract:

This work is, to the best of our knowledge, the first systematic attempt at characterizing and comparing NoSQL stores for RDF processing. In the following, we describe four different NoSQL stores and compare their key characteristics when running standard RDF benchmarks on a popular cloud infrastructure using both single-machine and distributed deployments.

The paper then describes the storage and querying of RDF using HBase with Jena for querying, HBase with Hive as the query engine (with Jena’s ARQ to parse the queries before converting them to HiveQL), CumulusRDF (Cassandra with Sesame), and Couchbase. The study also includes the 4store triplestore so that the authors could compare their NoSQL storage benchmarks with those of a native RDF triplestore. (As you might guess from its name, 4store is actually a quad store—and speaking of quads, while adding links to this paragraph, I found that fully four technologies listed here are their own separate Apache projects.)

The benchmarks and testing environments are all rigorously documented in the paper. You can read these details yourself, so I’ll skip ahead to the end of their conclusion: “we are confident that NoSQL databases will present an ever growing opportunity to store and manage RDF data in the cloud.”

I didn’t recognize many of the authors’ names, but I certainly recognized the name of Juan Sequeda of the University of Texas and Capsenta. His PhD work at UT that led to Capsenta’s Ultrawrap product makes Juan about the most qualified person I can think of to perform this kind of methodical review of the potential value of NoSQL database managers for storing and querying RDF, so I’m glad that he and his co-authors on the paper are doing this. Additional good news is that they’ve made “all results, as well as [their] source code, how-to guides, and EC2 images to rerun [their] experiments” available on their project’s web site for others to build on, and it looks like they have continued that work since publishing the paper. I look forward to further reports from them as efforts to store RDF in NoSQL database managers move forward.

Using SPARQL queries from native Android apps

Bob DuCharme — Sat, 09 Nov 2013 09:13:54 -0500

With a free, kid-friendly development kit.

Google once developed a simple environment called Google App Inventor for easy development of native Android apps. After they announced that they would discontinue support and open source it in 2011, the MIT Center for Mobile learning picked it up, so it’s now the MIT App Inventor. (Its Wikipedia page has a nice summary of its history.) I played with it a bit and found it pretty easy to build apps for my phone, even an app that used an RDFS model to drive a user interface. My simple experiments only scratched the surface of what was possible using SPARQL and RDF as part of a mobile app, and much more sophisticated work is on the way from our friends at the Tetherless World Constellation group.

To get a flavor for how application development works with this toolkit, flip through some of the tutorials, especially the Hello Purr one where they recommend that you start. After installing an App Inventor tool on your phone, you log in with a Google ID to a web-based application that lets you design your screens by dragging on and configuring various components. From there, you download a Java application called the blocks editor where you configure programming logic. With a wi-fi connection from the machine running those to your phone, you can try out your app on your phone as you work on it with the screen designer and blocks editor. The documentation tells you more about the available components and blocks.

The screen designer lets you add pick lists to your app, and because their choices can be configured dynamically (and because an App Inventor web component lets you do HTTP GETs, and because there are plenty of string manipulation functions) I wrote an app that sets the pick list choices with the results of a SPARQL query. It’s a nice example of model-driven development—something we talk about a lot at TopQuadrant—in which an application’s behavior is driven by a model stored in an RDFS schema or an OWL ontology.

I’ve written more about how my little phone app works below, but first wanted to say a little more about the Tetherless World Constellation’s work on new App Inventor semantic web components for use in these applications, because these will allow much more sophisticated use of RDF than my demo does. Others at MIT have already built a Disaster relief phone app with it.

Instead of logging in to the web-based screen design app mentioned above, using these new semantic web components currently requires the use of a specialized version of this application hosted at tw.rpi.edu. Apparently these new components are on their way to inclusion in App Inventor 2, so I’m really looking forward to that. You can learn more about how these extensions work from a YouTube video of a presentation by Tetherless World’s Evan Patton titled Extending the MIT AppInventor with Semantic Web Techno.

My little application lets you pick an item of clothing, the size, and the color, and then it sends a string of the selected data off to a script on another server.

The choice of items, sizes, and colors comes from the model below, stored on a SPARQL endpoint:

@prefix ps:   <http://snee.com/ns/demos/productSchema#> . 
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 

ps:Product a rdfs:Class . 

ps:Color   a rdfs:Class . 

ps:Size    a rdfs:Class . 

ps:color a rdf:Property ;
         rdfs:domain ps:Product ;
         rdfs:range ps:Color .

ps:size a rdf:Property ;
        rdfs:domain ps:Product ;
        rdfs:range ps:Size .

ps:tshirt  a ps:Product ; rdfs:label "T-shirt" . 
ps:sweater a ps:Product ; rdfs:label "sweater" . 
ps:pants   a ps:Product ; rdfs:label "pants" . 

ps:black a ps:Color ; rdfs:label "black" . 
ps:blue  a ps:Color ; rdfs:label "blue" . 
ps:white a ps:Color ; rdfs:label "white" . 

ps:small  a ps:Size ; rdfs:label "small" . 
ps:medium a ps:Size ; rdfs:label "medium" . 
ps:large  a ps:Size ; rdfs:label "large" .

When you touch the color button on the interface, the app displays the choices from the model:

Selecting one displays that value on the button:

After you select an item, color, and size, touching the Submit button sends the selected data off to another web server with an HTTP GET.

The interesting part of this app (at least, to RDF geeks) is clearer when we change the model that specifies the interface details—for example, by adding a new instance to the data model’s Color class on the server with the SPARQL endpoint:

ps:red a ps:Color ; rdfs:label "red" .

After clicking the app’s Refresh button (or shutting down and restarting the app) the next time you press the Select Color button, you’ll see the new choice of colors reflected:

Here’s how it works: upon startup or when pressing the Refresh button, the app sends the following query to the SPARQL endpoint to find out how instances of the Product class are modeled, requesting the result as comma-separated values. To do this, the query asks for all the properties associated with the Product class (that is, which properties have an rdfs:domain of ps:Product) and what their potential values are (that is, what the instances of the class specified as each property’s rdfs:range value are):

PREFIX ps:   <http://snee.com/ns/demos/productSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?list ?listItem
WHERE 
{
  {
    ?property rdfs:domain ps:Product ;
              rdfs:range ?range . 
    ?propertyValue a ?range ;
                   rdfs:label ?listItem .
    BIND(strafter(str(?property),"#") AS ?list)
  }
  UNION
  { 
    ?item a ps:Product ; rdfs:label ?listItem .
    BIND("item" AS ?list)
  }
}

App Inventor blocks offer plenty of options for parsing out the CSV, so my app uses some of these to find the values it needs in the query result and then uses those values to set each list picker widget’s choices.

If the data had included the value labels in multiple languages, like this, the query above would need a small addition to retrieve only the English versions of the labels:

ps:black a ps:Color ; rdfs:label "black"@en ; rdfs:label "negro"@es . 
ps:blue  a ps:Color ; rdfs:label "blue"@en  ; rdfs:label "azul"@es .  
ps:white a ps:Color ; rdfs:label "white"@en ; rdfs:label "blanco"@es .

A slight change to that version of the query could have it retrieve the Spanish labels instead, so creating an Android app with easily configurable multi-language support would be easy. (I could use the same technique of using RDF and SPARQL to change the rest of the text on the form—for example, to change the button label “Select Size” to say “Selecciona Tamaño.”) Because App Inventor lets you dynamically assemble the URL containing the SPARQL query before sending it to the endpoint, the app could modify the query to retrieve either English or Spanish labels based on whether the user picked “English” or “español” from a new “Select language” button that would be easy to add.

It would have been even nicer if, instead of hardcoding my form with Item, Color, and Size fields, those could have been auto-generated in the form based on which properties the query found that had an rdfs:domain value of ps:Product. This is the sort of thing that the Tetherless World extension will allow. (TopQuadrant’s TopBraid platform has always made this possible, but native phone apps are not a current target.)

I want to reiterate two key points about App Inventor: first, it’s very easy to use, drawing a lot on MIT research into programming for kids using environments such as Scratch. (Young people continue to be a big target for App Inventor developer evangelism.) Second—and this is especially impressive considering that we’re talking about a programming environment that’s so easy to use—it’s creating native apps. You’re not creating scripts that require some runtime thing to execute; you can create .apk files that anyone with an Android phone can install and use. I think that this is pretty exciting, and the ability to work RDF-based technology into the mix makes it even more exciting.

Lou Reed

Bob DuCharme — Mon, 28 Oct 2013 08:35:12 -0500

And New York City.

(To listen to while you read this: The Blue Mask.) New York City helped to define who Lou Reed was, but since I first became aware of him in the mid-seventies, Lou Reed played a big part in defining what New York City was to me. It’s difficult for me to picture the city without him.

The possibility of actually seeing him in public there was part of the fun. Once, while my wife and I were attending the lesser-known Shakespeare play “Corialanus” at the BAM Harvey theater, I went to the men’s room during the intermission, and on my way out there was a line of guys waiting to get in, with Lou halfway back in the line.

My better Lou story took place at Rudy’s Music Stop somewhere in the mid-eighties. 48th street was known for its music stores, and while there were big famous ones like Sam Ash and Manny’s, when I worked near that neighborhood and was in a band I usually went to Rudy’s Music Stop, a smaller one, to get guitar strings and so forth. They specialized in Schecter guitars, one of which I had (and still have), and Reed was Schecter’s most famous customer. One day on my lunch break I went there to look at some pickups I was thinking of adding to my guitar, and Reed was sitting on a fold-up chair, alone in a store that would have been crowded with seven people in it. You know how sometimes you see a celebrity in day-to-day life and you’re not sure whether it’s really the person you think it is? With Lou Reed, there was absolutely no question who was sitting there with a black T-shirt and black jeans faded to two different shades of gray.

I didn’t want to be a gushing fan boy and tried to act like a cool New York musician guy, with plenty of inspiration for this three feet away from me. There was an empty, open guitar case across the counter of the store’s main glass case, and I had to lean down to peer under it at the pickups that I was interested in, and Reed jumped up said “Oh, let me move that for you.” I guess he was waiting for them to do some work on the case’s contents in the back room; I took the opportunity to say that I had seen him the previous April at the Ritz (after it moved to the former Studio 54—it was his Blue Mask tour) and that he had an amazing band with him: Robert Quine, another hero of mine, on lead guitar; Fred Maher on drums, and Fernando Saunders on bass. Lou said “Thanks, man” and I left it at that.

I’ve certainly heard stories of him being an asshole to people, but I’ll never forget him jumping up to move his guitar case for me. I’ll also never forget how I found “White Light White Heat” in a local record store cutout bin when I was 16 and thought “this is Lou Reed’s old band, before he put out ‘Walk on the Wild Side’ and ‘Rock and Roll Animal’” and how I brought it home, put it on, and learned—as I learned from William Burroughs around the same time—that there was a much bigger world out there than I had imagined. And I’ll never forget how, since then, Reed put out enough great music to guarantee his historical importance even if there had never been a Velvet Underground: Street Hassle, The Blue Mask, Magic and Loss, and the solo albums before I discovered the Velvet Underground: Transformer, Coney Island Baby, Berlin, and all the ones in between. I saw him perform live four times, and he was always mesmerizing and always rocked very, very hard.

In the simplified history of rock and roll, Bob Dylan showed everyone that lyrics could be about more than cars, girls, and school. From his study with Delmore Schwartz at Syracuse University, Reed had already figured that much out, and when that background got paired with John Cale’s Lamont Young and John Cage influence in a loud dissonant band playing at Andy Warhol parties, it made people of that decade and every decade since rethink the possibilities of what rock and roll could be. Reed continued to produce great songs, lyrics, music, and guitar playing in each of those decades, and it’s sad to think that we in general and New York City in particular won’t have him anymore.

Lou Reed picture by Mike McGrath, Creative Commons CC BY-NC-ND 2.0

Linked Open Data Cloud: The Animated GIF!

Bob DuCharme — Thu, 17 Oct 2013 08:52:10 -0500

My first animated GIF.

I’ve had a new respect for animated GIFs since reading Anil Dash’s blog posting Animated GIFs triumphant. When I found that I could create them with gimp, a program on my top five list of software to install on a brand new machine, I couldn’t resist trying to make one. I have plenty of PowerPoint presentations where a series of slides show the growth of the Linked Data Cloud, so I made the animated GIF you see here of the available diagrams. Click it to see the full-sized version.

While the hilarious You Suck at Photoshop videos are a deliberate parody, the YouTube video Animated Gif Tutorial GIMP 2.6 seems like an untintentional parody, with someone using a jigsaw and other crashing noises in the background, but it did showed me what I had to do. (For one thing, I had to get more comfortable using gimp layers.)

If you’re wondering why there haven’t been any new diagrams since September of 2011, the answer is good news: the network of available linked open data sites just got too big to fit into such a diagram. People are making new linked data sources available all the time, both small and experimental and large and robust-looking. Lately a lot of people have been complaining about the existence of unreliable public SPARQL endpoints out there; I prefer to concentrate on the ones that work, like the new EMBL-EBI one. (The regular web has plenty of dead sites too. And, I don’t build applications with dynamic dependencies on public endpoints. If they have data that will be useful to me, I use SPARQL queries to pull that data down where I can store it locally. I mean, duh.)

My animation shows a very exciting period in the history of the growth of linked data. And, for an added bonus, I’m sure its transition from black and white to color reminds you of the corresponding moment of the “The Wizard of Oz,” right?

Making charts out of SPARQL query results with sgvizler

Bob DuCharme — Sun, 22 Sep 2013 09:21:28 -0500

Embed a query in your HTML, name an endpoint, and pick a chart type.

I finally got around to trying sgvizler, and I wish I’d done so earlier. Once your HTML page references the sgvizler JavaScript and CSS, you can specify a query to send to any SPARQL endpoint you want and then see a chart of the query results on that web page. Scroll down a bit on sgvizler’s Google code home page and you’ll see a nice range of available chart types.

After I downloaded and unzipped the sgvizler distribution (a file that, before unzipping, was all of 72K in size) I had a directory with a few files and an example subdirectory. One of the files was sgvizler.html, which displays a SNORQL-like form where you can enter queries to send off to the http://sws.ifi.uio.no/sparql/world endpoint. The page includes a lot of JavaScript code where you can change the endpoint and other parameters; I had some trouble figuring this out and was happy to find that the files in the example subdirectory were much more minimal and only required the setting of attributes on an HTML div element to configure.

Based on those examples, I created a simple web page using sgvizler that creates a Revenue and operating income of US computer companies chart using data from DBpedia. If you follow the link in the previous sentence, you’ll see the image dynamically generated. Here’s a screen shot:

If you follow that link and do a View Source you’ll see that very little was required besides the URL of the endpoint and the query itself. The HTML head element has some link and script elements that point to appropriate sgvizler JavaScript and CSS files, and then the actual chart is specified with an empty div element that enumerates details with various attributes:

 <div id="sgvzl_example_query" 
   data-sgvizler-endpoint="http://dbpedia.org/sparql"
   data-sgvizler-chart="gColumnChart"
   data-sgvizler-chart-options="title=Revenue and Operating Income of US Computer Companies (revenue > $1B)|
                                vAxis.title=US Dollars|chartArea.left=150"
   data-sgvizler-loglevel="2"
   style="width:1200px; height:400px;"         
   data-sgvizler-query='
     PREFIX dbo: <http://dbpedia.org/ontology/>
     PREFIX dct: <http://purl.org/dc/terms/>
     SELECT ?name ?revenue ?operatingIncome
     WHERE {
       ?company rdfs:label ?taggedName ;
               dct:subject <http://dbpedia.org/resource/Category:Computer_companies_of_the_United_States> ;
               dbo:revenue ?revenueFloat ;
               dbo:operatingIncome ?opIncomeFloat . 
        BIND(xsd:integer(?revenueFloat) AS ?revenue) . 
        BIND(xsd:integer(?opIncomeFloat) AS ?operatingIncome) .
        BIND(str(?taggedName)AS ?name)
        FILTER(lang(?taggedName) = "en")
        FILTER(?revenue > 1000000000)
     }
     ORDER BY ?revenue
'></div>

The use of the different attributes is documented on their UsingSgvizler Google code page. Note how the actual SPARQL query, bolded above, is just another attribute value: data-sgvizler-query. (Also note that each < character in the query must be represented as the entity reference < because it’s inside of an attribute value.)

Because sgvizler builds on Google charts, the options for the data-sgvizler-chart-options attribute depend on which chart type you select in the data-sgvizler-chart attribute; see the sgvizler home page for named examples of the options. I picked gColumnChart for image above and found options like title and vAxis.title at the Google Charts Visualization: Column Chart page. (I haven’t tried any of the animation or interactivity options, and I’m not sure which sgvizler supports, but they sound like fun.)

With my USComputerCompanies.html file sitting on my hard disk and the appropriate sgvizler files in its parent directory, I could do a File/Open from Chrome or Firefox and the generated image displayed just fine. It turns out that the image doesn’t necessarily display when opening an HTML page like this if the SPARQL endpoint that it references is local, as opposed to being remote like DBpedia. I think this is because of some jQuery features to prevent malicious code from doing damage. To make it work with a local endpoint, the key is to point your browser at a local web server and open the HTML file with an http:// URL instead of using File/Open to open it as if it were a file:/// URL. As far as I could tell, this was necessary with any HTML file that used sgvizler to send a SPARQL query to an endpoint at http://localhost.

For example, TopBraid Composer Maestro Edition can act as a local SPARQL endpoint, but sgvizler wouldn’t create a chart of results retrieved from http://localhost:8083/tbl/sparql when I displayed my tbtest.html file by selecting File/Open in my browser. When I put tbtest.html in a tbl-www\sgvtest\sample subdirectory of a project called sandbox and the appropriate sgvizler files in the tbl-www\sample directory, sgvizler had no problem when I sent a browser to http://localhost:8083/tbl/data/sandbox/sgvtest/sample/tbtest.html, and displayed this graph of which schools had more than one attendee from the Kennedys extended family:

A different example: using Sesame as a local endpoint, I created a sesameTest.html file that sent a SPARQL query to the Sesame endpoint http://localhost:8080/openrdf-sesame/repositories/myRepo. When I stored the HTML file in webapps\ROOT\sgvtest\samples in the Tomcat directory where I was running Sesame (ROOT being where you’d store files being delivered by Tomcat acting as a regular web server outside of the Sesame servlet), I opened up the web page as http://localhost:8080/sgvtest/samples/sesameTest.html and sgvizler generated the test graph from the results of that page’s SPARQL query just fine, unlike when I opened the same file with File/Open.

All my tests showed one chart at a time, but some examples on the sgvizler web site show how easy it is to display multiple graphs at once.

Between the range of available charts, the extra attributes available to customize each chart type’s appearance, the fact that I eventually got it to work with every SPARQL endpoint that I tried it with, and the ability to set everything up by merely entering values in div attributes in an HTML page with no JavaScript wrangling necessary, I am very impressed with sgvizler. I look forward to using it more.

Semantic Web Journal article on DBpedia

Bob DuCharme — Sun, 25 Aug 2013 10:31:15 -0500

DBpedia: more impressive all the time.

It took me a while to finally sit down and read the Semantic Web Journal paper “DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia” (pdf), but I’m glad I did, and I wanted to summarize a few things I learned from it.

Near the beginning the paper has a good summary of what DBpedia is working from:

Wikipedia articles consist mostly of free text, but also comprise various types of structured information in the form of wiki markup. Such information includes infobox templates, categorisation information, images, geo-coordinates, links to external web pages, disambiguation pages, redirects between pages, and links across different language editions of Wikipedia. The DBpedia extraction framework extracts this structured information from Wikipedia and turns it into a rich knowledge base.

That’s a rich knowledge base that is represented in RDF so that we can query it with SPARQL and treat it as Linked Data.

According to the article, the DBpedia project began in 2006, and four years later began an effort to develop “an ontology schema and mappings from Wikipedia infobox properties to this ontology… This significantly increases the quality of the raw Wikipedia infobox data by typing resources, merging name variations and assigning specific datatypes to the values.” Like DBpedia (and Wikipedia), this development is a community-based effort. The people working on it use the DBpedia Mappings Wiki, a set of tools that includes a Mapping Validator, an Extraction Tester, and a Mapping Tool.

I always described DBpedia as an RDF representation of Wikipedia infobox data, but this ontology work is only one example of how it does more than just provide SPARQL access to infobox data. As the infobox data evolves, the work of mapping it to an ontology is never done, so the available properties reflect the differences. For example, I had wondered about the difference between the properties http://dbpedia.org/property/birthPlace and http://dbpedia.org/ontology/birthPlace, and these two excerpts from the paper’s bulleted list about URI schemes explain it:

http://dbpedia.org/property/ (prefix dbp) for representing properties extracted from the raw infobox extraction (cf. Section 2.3), e.g. dbp:population.

http://dbpedia.org/ontology/ (prefix dbo) for representing the DBpedia ontology (cf. Section 2.4), e.g. dbo: populationTotal.

So, while there has been work to develop a DBpedia ontology, if some infobox field doesn’t fit the ontology, they don’t throw it out; they define a property for it in the http://dbpedia.org/property/ namespace. Of course, this doesn’t completely answer my original question, because if the ontology includes a http://dbpedia.org/ontology/birthPlace property, that sounds like a good place to store the value that had been stored using http://dbpedia.org/property/birthPlace. However, comparing the ontology/birthPlace values with the property/birthPlace values for some resources reveals that they don’t always line up perfectly, and the alignment can’t always be automated—just because the two URIs have the same local name doesn’t mean that they refer to the same thing—so the project stores all the values until a human can get to each resource to review these issues.

I also didn’t realize just how much modeling has been done. The diagram in Figure 3 of the paper illustrates some subclass, domain, and range relationships between various classes and properties such as the PopulatedPlace class. This SPARQL query shows not only that this class has six subclasses, but also that many properties have it as a domain or range. When I downloaded the T-BOX ontology that contains this modeling from the DBpedia’s DBpedia Ontology page and brought it up in TopBraid Composer, it looked great:

(Apparently, the property aircraftHelicopterAttack has a domain of MilitaryUnit and a range of MeanOfTransportation.) Another interesting point about this ontology appears later in the paper: “The DBpedia 3.8 ontology contains 45 equivalent class and 31 equivalent property links pointing to http://schema.org terms,” so it can enhance the value of collections of data using this increasingly popular vocabulary.

In RDF, object property values are more valuable than literal values because they can lead to additional data (hence the first principle of Linked Data: “Use URIs as names for things”), so it was nice to read about this step in DBpedia’s data preparation:

If an infobox contains a string value that is not linked to another Wikipedia article, the extraction framework searches for hyperlinks in the same Wikipedia article that have the same anchor text as the infobox value string. If such a link exists, the target of that link is used to replace the string value in the infobox. This method further increases the number of object property assertions in the DBpedia ontology.

It was also interesting to see how DBpedia makes changesets available to mirrors:

Whenever a Wikipedia article is processed, we get two disjoint sets of triples. A set for the added triples, and another set for the deleted triples. We write those two sets into N-Triples files, compress them, and publish the compressed files as changesets. If another DBpedia Live mirror wants to synchronise with the DBpedia Live endpoint, it can just download those files, decompress and integrate them.

Section 6.5 of the paper explains how popular this practice is, complete with a graph of synchronization requests.

DBpedia also uses a lot more Natural Language Processing techniques than I realized, providing some nice connections between the two different senses of the term “semantic web.” Section 2.6 of the paper (“NLP Extraction”) describes some fascinating additional work done beyond the straight mapping of infobox fields to RDF. Natural Language Processing technology is used to create datasets of topic signatures, grammatical gender, localizations, and thematic concepts based on analysis of the Wikipedia unstructured free text paragraphs. The thematic concepts one is especially interesting:

The thematic concepts data set relies on Wikipedia’s category system to capture the idea of a ‘theme’, a subject that is discussed in its articles. Many of the categories in Wikipedia are linked to an article that describes the main topic of that category. We rely on this information to mark DBpedia entities and concepts that are ‘thematic’, that is, they are the center of discussion for a category.

I tried to find an example of such a theme tying together some entities and concepts, but had no luck; I’d be happy to list a few here if someone can point me in the right direction.

Section 7.1 describes further NLP work such as the use of specialized NLP data sets “to estimate the ambiguity of phrases, to help select unambiguous identifiers for ambiguous phrases, or to provide alternative names for entities, just to mention a few examples.” It also describes DBpedia Spotlight, which is

…an open source tool including a free web service that detects mentions of DBpedia resources in text… The main advantage of this system is its comprehensiveness and flexibility, allowing one to configure it based on quality measures such as prominence, contextual ambiguity, topical pertinence and disambiguation confidence, as well as the DBpedia ontology. The resources that should be annotated can be specified by a list of resource types or by more complex relationships within the knowledge base described as SPARQL queries.

This is the first significant free tool I’ve heard of that can annotate free text with RDF metadata based on analysis of that text since Reuters Calais’ free service became available over five years ago. I definitely look forward to playing with that.

A few more fun facts from the paper:

I had wondered about DBpedia’s relationship to Wikidata, so I was happy to read that in “future versions, DBpedia will include more raw data provided by Wikidata and add services such as Linked Data/SPARQL endpoints, RDF dumps, linking and ontology mapping for Wikidata.”
I had heard that DBpedia was one of the datasets used when IBM’s Watson system won the quiz show Jeopardy, but seeing it in this paper made it a little more official for me.
In 2010, the DBpedia team replaced the PHP-based extraction framework with one written in Scala, the functional, object-oriented JVM language developed at the École Polytechnique Fédérale de Lausanne.
I won’t summarize it here, but the paper includes information on usage of DBpedia by spoken language as well as the hardware in use and the maximum number and amount of requests allowed from a given IP address.

The conclusion of the paper says that it “demonstrated that DBpedia matured and improved significantly in the last years in particular also in terms of coverage, usability, and data quality.” I agree!

Using VALUES to map values in a SPARQL query

Bob DuCharme — Mon, 01 Jul 2013 19:07:37 -0500

The VALUES keyword: even better than I thought.

Note: Ebook versions of the “raw, unedited” version of the new expanded edition of my book Learning SPARQL are now available on O’Reilly’s website, and the cooked, edited version (not much different, really) should be available in all formats within a few days. While this edition adds coverage of the VALUES keyword, I came up with the example below too late to include it.)

"if I could just define a little mapping table..."

I recently had to map a few values within a SPARQL query. I didn’t want to do a heavily nested IF() function, and thought “if I could just define a little mapping table…” and then realized that I can, using a new SPARQL 1.1 keyword that I’ve already written about: VALUES.

Let’s say I want to output the names of the people in the following data, not with their associated airport codes, but with the names of those airports’ cities.

@prefix d:  <http://learningsparql.com/ns/data#> .
@prefix dm: <http://learningsparql.com/ns/demo#> .


d:i0432 dm:firstName "Richard" ;
        dm:airport "CHO" . 


d:i9771 dm:firstName "Cindy" ;
        dm:airport "RIC" . 


d:i8301 dm:firstName "Craig" ;
        dm:airport "LYH" .

What I would really do is use triples associating airport codes with city names to drive the lookup, but storing the lookup information within the query gives some sense of how powerful the VALUES keyword can be for something like this:

PREFIX dm: <http://learningsparql.com/ns/demo#> 


SELECT ?first ?city
WHERE {


  ?person dm:firstName ?first ;
          dm:airport ?airport . 


  VALUES (?airport ?city) {
    ( "CHO" "Charlottesville" )
    ( "RIC" "Richmond" )
    ( "LYH" "Lynchburg" )
  }


}

Running that query on the data above produces this result:

---------------------------------
| first     | city              |
=================================
| "Richard" | "Charlottesville" |
| "Cindy"   | "Richmond"        |
| "Craig"   | "Lynchburg"       |
---------------------------------

Each row in my VALUES table had only two values, and they were both strings, but you can have any number you like—with any types, including URIs. This makes for a lot of possibilities.

So, the next time you’re thinking of adding a heavily nested IF() function to your SPARQL query to account for several possible values of something, consider using SPARQL’s newest keyword. A VALUES table is easier to create, use, and read.

Coming soon: new, expanded edition of "Learning SPARQL"

Bob DuCharme — Sun, 02 Jun 2013 19:44:41 -0500

55% more pages! 23% fewer mentions of the semantic web!

I’m very pleased to announce that O’Reilly will make the second, expanded edition of my book Learning SPARQL available sometime in late June or early July. The early release “raw and unedited” version should be available this week.

I’ve updated the book to account for the final version of the SPARQL 1.1 specs, but the main additions are four new chapters:

Query Efficiency and Debugging: Things to keep in mind that can help your queries run more efficiently as you work with growing volumes of data.
Working with SPARQL Query Result Formats: How your applications can take advantage of the XML, JSON, CSV, and TSV formats defined by the W3C for SPARQL processors to return query results.
RDF Schema, OWL, and Inferencing: How SPARQL can take advantage of the metadata that RDF Schemas, OWL ontologies, and SPARQL rules can add to you data.
A SPARQL Cookbook: A set of SPARQL queries and update requests that can be useful in a wide variety of situations.

I’ve also expanded the Application Development chapter quite a bit.

Preliminary reviewers have especially liked the cookbook chapter, and I learned a great deal researching, writing, and having the query efficiency chapter tech reviewed. I’m eager for others to see all the new chapters.

I’ve also made some corrections, improved the index, and many passive sentences were converted to the active voice (or rather, I converted many passive sentences…).

Having a lot more to it, the new edition will cost a little more, but if you bought an electronic version of the first edition, you can get the second edition in the same format for 40% off. At this week’s semtech conference in San Francisco, I’ll have some moo cards that give you 40% of the printed book or 50% off the ebook, so if you see me just ask for one.

I joke about the book’s 23% reduction in mentions of the semantic web (and incremental reduction in mentions of “linked data”), as contrasted with the page count going up 53%, because of my recent belief that SPARQL and other RDF-related technologies can be sold on their own merits instead of being sold as the implementation of a vision that people must first buy into. Let people select the technology that they feel is best—even if it has a strange name like Hadoop or MongoDB or SPARQL—to implement the visionary buzzphrase that is getting their project funded, whether it’s “Big Data” or “Semantic Web” or whatever new buzzphrase will be hot two years from now and first noticed by Gartner two years after that. I think SPARQL and the associated standards have a huge amount to offer all of these new visions of ways to do more with data and metadata.

A nineteenth-century linking application

Bob DuCharme — Wed, 01 May 2013 08:38:12 -0500

An encore presentation.

From early 2003 to late 2005 I wrote a blog on oreillynet.com that I called Thinking About Linking. The last entry summarizes what I covered and my experiences with that blog, but today I wanted to republish my favorite entry from that blog on the tenth anniversary of its original publication. It’s the same as the 2003 version except that I updated one link. On the right: a page from Shepard’s 1902 “Shepard’s Alabama Citations,” which I bought on ebay. (My comment below about link typing would certainly need updating now, given my experience with RDF.)

Frank Shepard was a salesman for a Chicago legal publisher. Shortly after the American Civil War, he noticed that when one court case overruled, criticized, or otherwise cited another, lawyers often jotted a note about it in the margin of the reporter volume with the cited case’s text. For example, upon learning that the judge in the case known as “La Bourgogne” (210 U.S. 95) made a negative references to the “Moore v. American Transportation Company” (65 U.S. 1) case, a lawyer might turn to page 1 in volume 65 of the U.S. Supreme Court case reporter and write “210 U.S. 95, negative” in the margin next to the Moore case. This way, if if the Moore case ever came up in court, the lawyer would have a better idea of its exact value.

Shepard had an idea: if he printed gummed labels for each case listing the cases that cited it, he could save the lawyers the trouble of writing in these references by hand. He built a business out of selling these inter-case links to the legal profession and named the company after himself: Shepard’s. (Full disclosure: since Reed Elsevier acquired Shepard’s in the mid-1990s, Shepard’s Citations has been a product of my employer, LexisNexis. Other than some occasional XSLT advice to the folks in Colorado Springs, where Shepard’s has been based since 1947, I don’t do any work on that particular product.) In one sense, the stickers they produced in 1873 were already more sophisticated than web links, because if more than one case had cited the same case, the sticker for that case added a one-to-many link to it.

To help the lawyers quickly learn why one case had been cited by another, Shepard’s started including one-letter codes to show that the citing case had overruled, criticized, modified, or applied some other treatment to the cited case. Now their links had link types: indications about the nature of the links to give a clue about why they might be worth traversing.

The stickers, or “Adhesive Annotations,” became very popular. While sitting on the Massachusetts Supreme Judicial Court, future United States Supreme Court Justice Oliver Wendell Holmes Jr. wrote “I regard Shepard’s Massachusetts Annotations as the most thorough labor-saving device that has even been brought to my attention. No one owning a set of reports can afford to be without one.”

Before the nineteenth century came to a close, the company began producing alternatives to the sticker collections: bound books that listed, for each case, the cases that cited it and codes describing the citing case’s treatment. Today, we call this separation of the links from the linked resources “out-of-line links.”

The books became so popular that their inventor’s last name became a verb. Any lawyer or law student knows that to Shepardize a case is to find out all relevant cases that cite it. Of course, automating the storage and lookup of these links is much easier with software, and it’s all online now. When you view a case using LexisNexis, clicking the “Shepardize” link displays a list of citing cases with links to the full text of those cases. This saves a lot of running around a law library, which was how the links were followed for the first century of their existence. (LexisNexis’s chief competitor, WestLaw, has a competing on-line product called KeyCite.)

The success of Frank Shepard’s invention tells us several things about linking:

Link typing can add real value to a linking application. If a lawyer who’s going to bring up a case in court Shepardizes it and sees only codes for positive treatment, there’s little need to look up the citing cases. If other cases criticized the case to be cited, however, it’s his job to find out why. (Too bad it’s so difficult to find other examples of link typing adding obvious value!)
Out-of-line links can sometimes be more useful than in-line links. The web and other hypertext systems leading up to it have conditioned many to think of a link as something that connects the resource they’re looking at to a single other resource somewhere else, but links can be more than that. Shepard’s customers found that having all the citation links in a single set of books instead of as a set of stickers to be spread around hundreds of volumes can make the research go much more quickly, especially with the treatment codes added to the link identifiers to give clues about whether the links are worth traversing.
It’s not about the technology, but about the information. Just as a well-written song can work well when performed by different bands, a good linking application can still have value when implemented using different technologies.

Appreciating SPARQL property paths more

Bob DuCharme — Wed, 17 Apr 2013 08:46:59 -0500

More and more useful.

I had been thinking of property paths as something that could slow down queries, and Paul's experience was that the property path version was more efficient.

I have played with SPARQL 1.1’s new property paths features and described them in my book, and I’ve felt that I understood them for a while, but two recent occasions have helped me to appreciate them even more.

First, to prepare for the talk I’m giving at the Semantic Technology & Business on Enhancing Searches with Semantic Technology, at one point my demo app needed to find a SKOS concept that has either a skos:prefLabel or a skos:hiddenLabel value of a particular string. At first I thought I’d need a UNION query, like this,

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?c
WHERE {
 ?c a skos:Concept .
 {?c skos:prefLabel "motrin"@en }
 UNION
 {?c  skos:hiddenLabel "motrin"@en }
}

but then I realized that the alternative path operator could make it much terser: just two triple patterns in the query, with the second one’s predicate expression essentially saying “a predicate of skos:prefLabel or of skos:hiddenLabel”:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT  ?c
WHERE {
 ?c a skos:Concept .
 ?c skos:prefLabel|skos:hiddenLabel "motrin"@en . 
}

The second occasion for appreciating property paths more was reading the recent Paul Groth blog posting 5 heuristics for writing better SPARQL queries, which recommended that we “use property paths to replace connected triple patterns where the object of one triple pattern is the subject of another.”

I’d seen examples of the XPath-like property paths, like the foaf:knows/foaf:name one in the SPARQL 1.1 Query Recommendation, but I hadn’t realized their value for replacing triple patterns where the object of one triple pattern is the subject of another that has a different predicate, and I’ve written a lot of those. For example, to find the four-step connection between d:a and d:e in the following,

@prefix d:  <http://learningsparql.com/ns/data#> .
@prefix dm: <http://learningsparql.com/ns/demo#> .


d:a dm:prop1 d:b . 
d:b dm:prop2 d:c . 
d:c dm:prop3 d:d . 
d:d dm:prop4 d:e .

I would have written a SPARQL graph pattern that looked pretty much like the four triples that you see there, but with variables substituted for d:b, d:c, and d:d. Paul’s blog entry made me realize that I could simply write this:

SELECT ?s ?o
WHERE
{ ?s dm:prop1/dm:prop2/dm:prop3/dm:prop4 ?o }

What makes this interesting is that I had been thinking of property paths as something that could slow down queries, and Paul’s experience was that the property path version was more efficient. Of course, I was generalizing too much—the property path * and + operators, while very handy, essentially say “and then keep looking for more,” which can really increase the search space and execution time. I suppose I was also still hearing the ringing in my ears of the alarm sounded by the paper Counting Beyond a Yottabyte, or how SPARQL 1.1 Property Paths will Prevent Adoption of the Standard (pdf), but that too was focusing on a subset of property paths options unrelated to the path format that Paul was discussing. (After the release of that paper and before SPARQL 1.1’s ascent to Recommendation status, the SPARQL Working Group did make adjustments to certain property path features to address the paper’s concerns.)

In my formerly extensive use of XSLT, I never got to the point where I couldn’t picture being limited to XSLT 1.0, even though 2.0 became a Recommendation in 2007. (I know that Jeni Tennison got to that point about about 2007, if not earlier.) Now that it’s been almost four weeks since the SPARQL 1.1 specs became Recommendations, I already have a difficult time being limited to SPARQL 1.0, which is still the case with some endpoints; there’s just so much great stuff in 1.1.

In publishing? Listen to WFMU's "Radio Free Culture" podcast

Bob DuCharme — Thu, 21 Mar 2013 09:04:05 -0500

A new radio show (and podcast) has some great observations about the future of content creation and distribution.

People who listen to Jersey City freeform radio station WFMU tend to be a bit fanatical about it. The Wikipedia page on the station quotes the New York Times referring to them as “a station whose name has become like a secret handshake among a certain tastemaking cognoscenti.” It’s not only because of the range of their musical eclecticism, which is an easy game for college and other non-profit radio stations to play; the depth of their commitment and their role in the music and art scenes of New York City and beyond has been impressive for over forty years.

They have a new show called Radio Free Culture which is also available as a podcast. Different FMU hosts from different shows take turns with this one, so it’s apparently on at different, unpredictable times, but the MP3s of past shows are all sitting there waiting for you. Some of the hosts are better than others, but as with the rest of the station, the unpredictability is part of the fun.

The discussions are often about music, but not exclusively so, and still—the music industry has already been through stages that the movie and “print” publishing industries are only now sliding into, and distribution of files of content is distribution of files of content. Roles in creating and publicizing that content, the potential value of redistributors (for example, record companies or publishers), and especially issues about who pays who for what, at which stage of creation or distribution, are topics that the podcast returns to regularly.

The most recent show, on February 25th, interviewed several people involved with the Future of Music Coalition’s Artist Revenue Streams project, which gathered data on how musicians make money today, with statistics about how 25 categories of musicians make money from 48 potential income streams. Of course, the number and relative importance of the different categories and streams has evolved over time, and the reaction to the FMC’s work shows that there hasn’t been much serious data gathering in this area before, because a lot of organizations are very interested in using their work.

The December 31st show has a fascinating interview with MIT PhD candidate Benjamin Mako Hill about the implications to our culture of the fact that the song “Happy Birthday to You” is copyrighted. Did you know that if a group of people sitting around a table in a restaurant or a bunch of kids in a summer camp sing this song without getting permission first, they are technically violating U.S. copyright laws? Warner Music Group collects literally millions of dollars every year from higher-profile performances of the song. (The second half of this particular show is people calling in to talk about their worst birthday ever, and I didn’t make it all the way through that.)

The December 24th show features talks from the FMU-sponsored Radiovision festival about “piracy” in its many meanings, with a particularly good talk by Anna Troberg, who once fought against content bootlegging but eventually became the leader of Sweden’s quite successful Pirate political party after getting to understand their values better.

The October 29th discussion about live streaming public protests with independent journalist and video broadcaster Tim Pool, as well as several other more music-focused shows, often return to the issues of how new technology makes it easier to create and distribute content, but how larger infrastructures are necessary to build an audience for that content—infrastructures once only provided by traditional publishing companies but now involving social media networks as well. Of course, the roles, relationships, and relative need for publishers and social media networks are further fuel for discussion.

I’ve know people involved in diverse aspects of many kinds of publishing, and I think that WFMU’s “Radio Free Culture” can teach all of us a lot about the range, direction, and magnitude of many of the current forces affecting how people create, distribute, pay for, and get paid for content now and in the future. (Further discussions of these topics as they relate to specific episodes are available at the show’s Free Music Archive page.)

"RDF and SPARQL" article published in "Big Data" journal

Bob DuCharme — Fri, 15 Feb 2013 08:53:45 -0500

Or: RDF, SPARQL, and Big Data, part 3.

A few months ago here I wrote SPARQL and Big Data (and NoSQL): How to pursue the common ground? followed by Selling RDF technology to Big Data, where I put forth some theories about how to describe the value of RDF technology to people who may or may not have heard of it but were clearly interested in the hot buzz phrase “Big Data.” To practice what I preached—or perhaps to just preach to a new audience—I submitted an article to the new academic journal Big Data. They’ve just just published their first issue, which includes my article “What Do RDF and SPARQL bring to Big Data Projects?” (pdf). The same issue also has interesting articles journal editor Edd Dumbill and the first of what will be a regular column by Jim Hendler; I haven’t checked out the other articles yet.

My article provides a basic introduction to RDF and SPARQL, playing up the RDF support by IBM’s DB2, Oracle’s Spatial product, and Cray’s uRiKA, because these companies are well-known brand names in the world of large-scale data processing. And, except for a reference to the Linked Open Data Cloud and the possibility of private linked data clouds, there is no mention of Linked Data and no use of the phrase “semantic web,” in keeping with the my ideas described in “Selling RDF technology to Big Data.”

The paperwork I filled out on the way to having the article published led me to believe that this would be one of those expensive, tightly-controlled academic journals, but it looks like it’s being published with a Creative Commons CC-BY license, which was great to see. I look forward to their future issues.

Finding Europeana audio with SPARQL

Bob DuCharme — Sun, 13 Jan 2013 11:06:26 -0500

And video!

As a SPARQL geek's alternative to YouTube, the 166,872 video resources with a `edm:type` value of "VIDEO" look like a tempting way to kill some time.

When I first heard about the SPARQL endpoint for the Europeana aggregation of data about European cultural artifacts, the first example I heard about was an MP3 audio file of a Slovenian version of O sole mio. I happened to be in the middle of packing for a family visit over Christmas and immediately tweeted “Lots of holiday stuff to do, but the new Ontotext Europeana SPARQL endpoint points to MP3s! So tempting…” This past Sunday morning I finally made some time to explore it more, and I found 6,219 audio files.

The following query pulls down data about 100 of them (which 100 you pull depends on the OFFSET value), and this XSLT stylesheet converts a SPARQL XML query result version of the results to a simple HTML file that shows the title, creator, and source of each one, with the title being a hypertext link to the audio file itself. Following some of these links, I found folk music, classical music, interviews, and plenty of Finnish spoken word material where I had no idea what they were saying.

Here is the query itself:

PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/> 


SELECT ?title ?mediaURL ?creator ?source WHERE {
  ?resource edm:type "SOUND" ;
            ore:proxyIn ?proxy ;
            dc:title ?title ;
            dc:creator ?creator ;
            dc:source ?source . 
  ?proxy edm:isShownBy ?mediaURL . 
 }
OFFSET 600
LIMIT 100

This link runs the query with an offset of 3000, and this web page shows the result of running the stylesheet on the query results when run with an offset of 600 as above. As you’ll see and hear by following that page’s links, that batch seems to be mostly Norwegian folk music.

A few notes:

As I mentioned in the tweet, it’s running Ontotext’s OWLIM triplestore. This made it the first large public endpoint that I’ve seen with SPARQL 1.1 support, which was great to see. I didn’t need any 1.1 features for the query above, but did for others on my way there—for example, to find out that there were 6,219 audio files.
About half of the audio URLs had “mp3” at the end. When I tried some of the audio URLs that didn’t, they seemed to play audio just fine, but there may be some that don’t link to playable audio.
The proxy parts of the query deal with a level of indirection that was necessary because the site federates data from other sites. Documentation of the data model is available (well, it isn’t the morning of January 13th, but Google has a cached copy), but I got to the query above by various hit-and-miss experiments starting with one that looked for resources whose names ended with “.mp3”.
The web-based front end to the Europeana SPARQL endpoint did some nice parentheses matching and color-coding of syntax as I entered queries. It doesn’t compare with TopBraid Composer’s SPARQL view, which has command completion and other IDE-oriented features, but it was impressive for a field on a web form.

There is plenty more metadata available in addition to the title, creator, and source that my query requests for each resource; I encourage you to try variations on the query to explore it. Other possible edm:type values are TEXT, IMAGE, VIDEO and 3D. (The two 3D resources were a 70-meg two-page PDF and a 59-meg eight-page one, each showing a church in Cyprus. Viewed with Adobe Reader, some of the images could be rotated, I think.)

As a SPARQL geek’s alternative to YouTube, the 166,872 resources with an edm:type value of “VIDEO” are a tempting way to kill some time. Just substitute “VIDEO” for “SOUND” in the query above and you’ll be off and running. (Don’t forget that LIMIT keyword, though—be polite and don’t ask for too much at once.)

Normalizing company names with SPARQL and DBpedia

Bob DuCharme — Wed, 05 Dec 2012 07:46:10 -0500

Wikipedia page redirection data, waiting for you to query it.

If you send your browser to http://en.wikipedia.org/wiki/Big_Blue, you’ll end up at IBM’s page, because Wikipedia knows that this nickname usually refers to this company. (Apparently, it’s also a nickname for several high schools and universities.) This data pointing from nicknames to official names is also stored in DBpedia, which means that we we can use SPARQL queries to normalize company names. You can use the same technique to normalize other kinds of names—for example, trying to send your browser to http://en.wikipedia.org/wiki/Bobby_Kennedy will actually send it to http://en.wikipedia.org/wiki/Robert_F._Kennedy—but a query that sticks to one domain will have a simpler job. Description Logics and all that.

The query below can be run with any SPARQL client that supports 1.1. I wanted it to cover these three cases:

Run it with an unofficial company name such as Big Blue, Apple Computer, or Kodak, and it should return the official company name.
Run it with an official company name such as IBM, Apple, Inc., or Eastman Kodak, and it should return that name.
Run it with something that isn’t a company, such as Snee, and it shouldn’t return anything.

The query’s first BIND statement sets the name to check (including a language tag, because DBpedia is pretty consistent about using those) in the ?inputName variable, and the SERVICE keyword sends the bolded part of the query off to DBpedia’s SPARQL endpoint.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpo: <http://dbpedia.org/ontology/>
SELECT ?name 
WHERE {
  BIND("Big Blue"@en AS ?inputName) 
  SERVICE <http://dbpedia.org/sparql> 
  {
    ?s rdfs:label ?inputName .
    {
      ?s dbpo:wikiPageRedirects ?actualResource .
      ?actualResource a dbpo:Company . 
      ?actualResource rdfs:label ?redirectsTo . 
      FILTER ( lang(?redirectsTo) = "en" )
    }
    UNION
    { ?s a dbpo:Company . }
  }
  BIND(STR(COALESCE(?redirectsTo,?inputName)) AS ?name)
}

After finding a resource (?s) that has the bound value as an rdfs:label value, DBpedia returns the UNION of two graph patterns. The first checks whether this resource is supposed to redirect to another dbpo:Company resource, and if so, stores the English rdfs:label of that resource in the variable ?redirectsTo.

If that graph pattern doesn’t return anything because ?s doesn’t have a dbpo:wikiPageRedirects property, but DBpedia does know that it’s a dbpo:Company, the graph pattern after the UNION keyword will match.

After DBpedia returns any bound variables, the local client uses the COALESCE function to bind ?redirectsTo to the ?name variable if ?redirectsTo got bound, and otherwise binds ?inputName to it. (Because COALESCE is a new SPARQL 1.1 feature and DBPedia doesn’t support any of 1.1 that I know of yet, this part has to be done locally.) If nothing got bound, then there was no such company listed in DBpedia.

I tested this with both ARQ and TopBraid Composer. With TBC (including the free version), it was fun to put the whole query into a SPIN function that I called normalizeCompanyName, so that I could make calls such as normalizeCompanyName(“Kodak”) or normalizeCompanyName(“Apple, Inc.”) in the middle of other SPARQL queries.

It took me a lot of tweaking to get the query above to work the way I wanted to, and I wouldn’t be surprised if it can be improved at all. I’d love to hear any suggestions.

Selling RDF technology to Big Data

Bob DuCharme — Mon, 12 Nov 2012 08:51:59 -0500

A clue: what we're selling is just that—RDF technology.

I think I’ve figured it out. (This is a follow-up to my previous post SPARQL and Big Data (and NoSQL): How to pursue the common ground?) Here’s how to sell the Semantic Web and Linked Data visions to the Big Data folk: don’t. Sell them on RDF technology.

Instead of telling these people about the Semantic Web or Linked Data visions, we should show them how we have technology that fulfills the vision that's apparently captured their imaginations.

The process of selling a set of technologies usually means selling a vision, getting people psyched about that vision, and then telling them about the technology that implements that vision. For RDF technology (by which I mean RDF, SPARQL, and optionally, RDFS and OWL), the vision for many years was the Semantic Web. Some people in that community eventually decided that an easier vision to sell was Linked Data. (Linked Data may not always include RDF technology—when Tim Berners-Lee added “(RDF*, SPARQL)” to his list of Linked Data principles, it became the filioque controversy of the Linked Data community—but the boundaries of this or other sets of technologies I’m discussing are not the issue here. The point is, it’s very common to use the Linked Data vision to sell people on the value of using URIs, triples, and SPARQL together.)

Big Data is itself a vision. Note how it’s spelled in initial caps, like “Semantic Web” and “Linked Data,” and features prominently in sales pitches from large and small system vendors. The 166-page IBM educational/marketing PDF “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data” (available here with registration) is mostly about the Big Data vision: the issues, the common use cases that can now be handled, and in general, the possibilities. Instead of trying to sell Big Data people on one or two of our overlapping visions, we should be showing them the connections between our technology and the vision that they’re already sold on.

Hadoop and NoSQL are currently the technologies being used to implement this vision. Hadoop is a software framework for certain kinds of distributed applications; its MapReduce algorithm is also implemented by several of the NoSQL database managers. “NoSQL” is a blanket term for a family of database management technologies that were developed independently of each other with no particular standards or organization to coordinate between them (other than not being SQL), so a new addition to the family is not going to look like some odd appendage to a seamless whole. In the book Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement that I just read, the example database managers are PostgreSQL, Riak, HBase, MongoDB, CouchDB, Neo4J, and Redis. (While some people and organizations do work RDF into their NoSQL discussions, it’s not mentioned anywhere in this book.)

Besides PostgreSQL, the other database managers covered by the book are all considered to be NoSQL systems. Each has techniques for addressing the Big Data vision, which Edd Dumbill, IBM and may others summarize by discussing the three Vs: Volume, Velocity, and Variety. For reasons described in my last blog entry, RDF technology is excellent for addressing all of the Variety issues, and reading “Seven Databases” has further convinced me of this. Velocity, and to some extent Volume (more on this one below), are issues for a platform to address, not a set of standards, so for that you need to talk to an RDF-related platform vendor such as TopQuadrant. (We’d be happy to discuss your requirements.)

It’s a cliché in engineering-related sales that you have to focus on customer requirements. It’s also a sales cliché that talking about technical details will bore the people who write the checks. IBM and other such companies are putting big money into marketing Big Data solutions because they’ve found suit-wearing, check-writing managers who feel that their requirements line up with the promises of these solutions. Instead of telling these people about the Semantic Web or Linked Data visions, we should show them how we have (standardized!) technology that fulfills the vision that’s apparently captured their imaginations.

Reading the “7 Databases” book, I realized that the CAP theorem, although based on some technical issues, is also part of the Big Data vision. If I understand it correctly, the basic idea is that database administrators have always wanted Consistency, Availability, and Partition tolerance in their databases, and that distributed databases can only do two of these at a time well, and that by deciding that you can work around subpar performance for one if you can get great performance with the other two, new possibilities emerge—possibilities that wouldn’t have occurred to earlier generations of database administrators who strained to optimize all three. For example, if you give up the need to have all data on all nodes be consistent with all the data on the other nodes all the time (and you include steps to have it eventually become consistent, just not all the time), you can get increased availability (as long as one server is running, the database will return something) and partition tolerance (loss of communication between nodes won’t affect the system).

One point I didn’t make in my last posting is that the ease with which you can distribute and aggregate RDF triples in any combination gives you a lot more flexibility in how you implement your own two-out-of-three CAP theorem tradeoff. This should make it easier to store triples using one of the distributed NoSQL platforms; I don’t know of any definitive steps in this direction yet, but as I said before, Google searches show bits of work here and there.

This potential good fit of triples to the new possibilities opened up by the CAP theorem hadn’t occurred to me when I wrote my last blog entry, but by further study of the vision associated with the hot new data processing goals, I found another connection between that vision and the technology that we RDF types are offering. Which is just what we should be doing: identifying connections between what our technology can do and what these customers need. Especially customers who are pumped up about the latest big technology vision.

SPARQL and Big Data (and NoSQL)

Bob DuCharme — Wed, 24 Oct 2012 22:15:13 -0500

How to pursue the common ground?

I think it’s obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn’t seem as obvious to people who focus on those areas. For example, the program for this week’s Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

If there's just enough structure to get a toehold and build from there, your data is minimally structured.

But, we semantic web types can’t blame them for not noticing. If you build a better mouse trap, the world won’t necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I’ve been reading up on Big Data and NoSQL in order to better appreciate what they’re trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that “doesn’t fit the strictures of your database architectures” as a good candidate for Big Data approaches. That’s a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled “Ingesting and Cleaning” after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Once the data is collected, it must be ingested. In traditional business intelligence (BI) parlance, this is known as Extract, Transform, and Load (ETL): the act of putting the right information into the correct tables of a database schema and manipulating certain fields to make them easier to work with.

One of the distinguishing characteristics of big data, however, is that the data is often unstructured. That means we don’t know the inherent schema of the information before we start to analyze it. We may still transform the information — replacing an IP address with the name of a city, for example, or anonymizing certain fields with a one-way hash function — but we may hold onto the original data and only define its structure as we analyze it.

With my long history as an XML guy (which is how I know Edd, the former editor of XML.com), I know that ideas about “structured” vs. “unstructured” data are very relativistic—one person’s structured data is another person’s unstructured data, especially if the first person is an XML guy and the latter is an RDBMS person—and that the term “semi-structured” becomes the compromise adjective. I’ll coin a new term that seems to get no relevant Google hits: “minimally structured”—if there’s just enough structure to get a toehold and build from there, your data is minimally structured. And, RDFS is excellent if we want to “define [data’s] structure as we analyze it”. This can be done very incrementally, and OWL can take you many increments further.

Some of that minimal structure can be inferred and made explicit; for example, if you have data about people’s genders and and about who is the parent of who, you can infer father and mother relationships (and grandfather, and aunt, and…) and even classes by defining a Grandfather class as the set of instances that have a gender of male and have children who have children. I might say that this is creating new information, but a relational database person would say that it’s not—it’s just making implicit information explicit. Relational database people put a lot of effort into avoiding the explicit storage of information that can be otherwise inferred, but a relational database is a very closed world, so new possibilities of things to infer within a given set of data don’t come up often. Accumulation of RDF from multiple sources can be very dynamic, making it easier to create new wholes that are greater than the sum of their parts (made greater by this kind of inferencing) which opens up new possibilities for patterns to find in different combinations of data.

Another quote from Edd’s book:

Even where there’s not a radical data type mismatch, a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it.

So do triplestores, which give you the best of both worlds: with no need for a schema, you can accumulate data and query it using a standardized query language, and then if you want you can incrementally add schema metadata (often based on query results) to aid further queries.

Another quote on this topic:

NoSQL databases are frequently called “schemaless,” because they don’t have the formal schema associated with relational databases. The lack of a formal schema, which typically has to be designed before any code is written, means that schemaless databases are a better fit for current software development practices, such as agile development. Starting from the simplest thing that could possibly work and iterating quickly in response to customer input doesn’t fit well with designing an all-encompassing data schema at the start of the project. It’s impossible to predict how data will be used, or what additional data you’ll need as the project unfolds.

Again, all very easy with RDF-based technology, where in addition to the choices of “assemble a big schema before you start developing” and “just blow off schemas, because they impair flexibility” you can work with a middle ground of little bits of schema metadata added when you need them as you go along.

From what I’ve heard of the various classes of NoSQL databases, graph-oriented ones like Neo4J sound the closest to triplestores, which are also storing graphs. This description of another class of NoSQL databases really caught my attention, though:

Cassandra and HBase are usually called column-oriented databases, though a better term is a “sparse row store.” In these databases, the equivalent to a relational “table” is a set of rows, identified by a key. Each row consists of an unlimited number of columns; columns are essentially keys that let you look up values in the row. Columns can be added at any time, and columns that are unused in a given row don’t occupy any storage. NULLs don’t exist.

This is the “equivalent to a relational ’table’”? It sounds more like the equivalent to a set of triples grouped by subject. Properties (predicates) are essentially keys that let you look up values associated with subjects; you can add property name/value pairs to a subject at any time, because they don’t depend on some schema, and properties that aren’t used for a given resource don’t occupy any storage. (And NULLs don’t exist.)

What I’d love to see, and have heard about tentative steps toward, would be SPARQL endpoints for some of these NoSQL database systems. The D2RQ and R2RML work have accomplished things that should be easier for graph-oriented NoSQL databases like Neo4J and, if I understand the quote above correctly, for column-oriented NoSQL databases as well. Google searches on SPARQL and either Hadoop, Neo4J, HBase, or Cassandra show that some people have been discussing and even doing a bit of coding on several of these. (In addition to the column- and graph-oriented NoSQL databases, another category is the “document-oriented” ones, so AllegroGraph’s interface to MongoDB is an excellent sign of progress in this direction.) What can we do to encourage more of this kind of interaction?

I have a lot more research to do, so I just started reading Eric Redmond and Jim Wilson’s Seven Databases in Seven Weeks. I will report back on further ideas I have. Meanwhile I’d appreciate hearing anyone else’s opinions on how Big Data and NoSQL technology and standards-based semantic technology can better take advantage of what each other has to offer.

SPARQL 1.1's new VALUES keyword

Bob DuCharme — Sat, 29 Sep 2012 16:42:52 -0500

New ways to filter search results.

SPARQL 1.1’s new BIND keyword lets you assign a value to a variable, and the even newer VALUES keyword lets you create tables of values, giving you new options when filtering query results. As the July 24th draft of the SPARQL query 1.1 spec (where the keyword first appeared) tells us, VALUES, “replaces and generalizes BINDINGS,” a new keyword from earlier drafts of the SPARQL 1.1 spec. The ARQ 2.7.4 snapshot supports the VALUES keyword, so I played with it a bit.

The following query ignores any input you pass to it (make sure to pass some anyway if you’re using command line ARQ, which complains if you don’t include a --data parameter) and demonstrates how you can create a table of values. This example populates the table with qnames and literal values, but you can use any kinds of RDF values you want:

PREFIX dm: <http://learningsparql.com/ns/demo#>


SELECT * WHERE { 
     VALUES (?color ?direction) {
     ( dm:red  "north" )
     ( dm:blue  "west" )
  }
}

Here’s the result:

-----------------------
| color   | direction |
=======================
| dm:red  | "north"   |
| dm:blue | "west"    |
-----------------------

This result isn’t particularly exciting, but it shows how simple it is to create a two-dimensional table in a SPARQL query. To see what VALUES can add to our queries, we’ll use the following dataset:

@prefix e: <http://learningsparql.com/ns/expenses#> .
@prefix d: <http://learningsparql.com/ns/data#> .


d:m40392 e:description "breakfast" ;
         e:date "2011-10-14" ;
         e:amount 6.53 . 


d:m40393 e:description "lunch" ;
         e:date "2011-10-14" ;
         e:amount 11.13 . 


d:m40394 e:description "dinner" ;
         e:date "2011-10-14" ;
         e:amount 28.30 . 


d:m40395 e:description "breakfast" ;
         e:date "2011-10-15" ;
         e:amount 4.32 . 


d:m40396 e:description "lunch" ;
         e:date "2011-10-15" ;
         e:amount 9.45 . 


d:m40396 e:description "lunch" ;
         e:date "2011-10-15" ;
         e:amount 6.20 . 


d:m40397 e:description "dinner" ;
         e:date "2011-10-15" ;
         e:amount 31.45 . 


d:m40398 e:description "breakfast" ;
         e:date "2011-10-16" ;
         e:amount 6.65 . 


d:m40399 e:description "lunch" ;
         e:date "2011-10-16" ;
         e:amount 10.00 . 


d:m40400 e:description "dinner" ;
         e:date "2011-10-16" ;
         e:amount 25.05 .

As a baseline, we’ll start with a simple query that asks for the values of all the dataset’s properties without using the VALUES keyword:

# filename: values1.rq


PREFIX e: <http://learningsparql.com/ns/expenses#> 


SELECT ?description ?date ?amount
WHERE
{
  ?meal e:description ?description ;
        e:date ?date ;
        e:amount ?amount . 
}

When run with the dataset above, this query lists all the description, date, and amount values:

---------------------------------------
| description | date         | amount |
=======================================
| "dinner"    | "2011-10-16" | 25.05  |
| "lunch"     | "2011-10-16" | 10.00  |
| "breakfast" | "2011-10-16" | 6.65   |
| "dinner"    | "2011-10-15" | 31.45  |
| "lunch"     | "2011-10-15" | 6.20   |
| "lunch"     | "2011-10-15" | 9.45   |
| "breakfast" | "2011-10-15" | 4.32   |
| "dinner"    | "2011-10-14" | 28.30  |
| "lunch"     | "2011-10-14" | 11.13  |
| "breakfast" | "2011-10-14" | 6.53   |
---------------------------------------

This next version of the query adds a VALUES clause saying that we’re only interested in results that have “lunch” or “dinner” in the ?description value:

# filename: values2.rq


PREFIX e: <http://learningsparql.com/ns/expenses#> 


SELECT ?description ?date ?amount
WHERE
{
  ?meal e:description ?description ;
        e:date ?date ;
        e:amount ?amount . 
  VALUES ?description { "lunch" "dinner" }
}

(Note that, in this case, the VALUES data structure being created is one dimensional, not two; this is still a step up from the BIND keyword’s ability to only assign a single value to a variable at a time.) With the same meal expense data, this new query’s output is similar to the output of the preceding one without the “breakfast” result rows:

---------------------------------------
| description | date         | amount |
=======================================
| "lunch"     | "2011-10-16" | 10.00  |
| "lunch"     | "2011-10-15" | 6.20   |
| "lunch"     | "2011-10-15" | 9.45   |
| "lunch"     | "2011-10-14" | 11.13  |
| "dinner"    | "2011-10-16" | 25.05  |
| "dinner"    | "2011-10-15" | 31.45  |
| "dinner"    | "2011-10-14" | 28.30  |
---------------------------------------

This query’s VALUES clause could go after the SELECT clause’s closing curly brace, instead of before it, and it wouldn’t affect the results. (This won’t always be the case with the VALUES clause in GROUP BY and federated queries.)

This next query of the same data creates a two-dimensional table to use for filtering output results:

# filename: values3.rq


PREFIX e: <http://learningsparql.com/ns/expenses#> 


SELECT ?description ?date ?amount
WHERE
{
  ?meal e:description ?description ;
        e:date ?date ;
        e:amount ?amount . 


  VALUES (?date ?description) {
         ("2011-10-15" "lunch") 
         ("2011-10-16" "dinner")
  } 


}

After retrieving all the meal data, this query only passes along the results that have either a ?date value of “2011-10-15” and a ?description value of “lunch” or a ?date value of “2011-10-16” and a ?description value of “dinner”:

---------------------------------------
| description | date         | amount |
=======================================
| "lunch"     | "2011-10-15" | 6.20   |
| "lunch"     | "2011-10-15" | 9.45   |
| "dinner"    | "2011-10-16" | 25.05  |
---------------------------------------

(It looks like someone had two lunches on October 15th.)

When you use VALUES to create a data table, you don’t have to assign a value to every position. The UNDEF keyword acts as a wildcard, accepting any value that may come up there. The following variation on the preceding query asks for any result rows with “lunch” as the ?description value, regardless of the ?date value, and also for any result rows with a ?date value of “2011-10-16”, regardless of the ?description value:

# filename: values4.rq


PREFIX e: <http://learningsparql.com/ns/expenses#> 


SELECT ?description ?date ?amount
WHERE
{
  ?meal e:description ?description ;
        e:date ?date ;
        e:amount ?amount . 


  VALUES (?date ?description) {
         (UNDEF "lunch") 
         ("2011-10-16" UNDEF) 
  }


}

The output of this query has more rows than the previous query:

---------------------------------------
| description | date         | amount |
=======================================
| "lunch"     | "2011-10-16" | 10.00  |
| "lunch"     | "2011-10-15" | 6.20   |
| "lunch"     | "2011-10-15" | 9.45   |
| "lunch"     | "2011-10-14" | 11.13  |
| "dinner"    | "2011-10-16" | 25.05  |
| "lunch"     | "2011-10-16" | 10.00  |
| "breakfast" | "2011-10-16" | 6.65   |
---------------------------------------

When you saw the descriptions of what each of these queries did, it may have occurred to you that all of these query conditions could have been specified without the VALUES keyword (for example, with a FILTER IN clause in the values2.rq query, although that would only work to replace a one-dimensional VALUES setting). That’s true, but I was using a small amount of data to demonstrate different ways to use the new keyword. When you work with larger amounts of data and especially with more complex filtering conditions, VALUES offers an extra layer of result filtering that can give you more control over your final search results with very little extra code in your query.

(Thanks to Andy Seaborne for reviewing this befor publication.)

IBM's DB2 as a triplestore

Bob DuCharme — Wed, 29 Aug 2012 19:45:00 -0500

Surprisingly easy to set up and use, but requiring lots of Java coding for any real application development.

I thought it was pretty big news for the semantic web world when IBM announced that release 10.1 of their venerable DB2 database manager could function as an RDF triplestore, but it seems that few others—not even, apparently, IBM staff responsible for marketing semantic technology—agreed with me. More on this below.

IBM invented relational databases, and DB2 has been their main relational database product for almost twenty years. It runs on mainframes, PCs, Linux, the iSeries (descendants of the AS/400) and other platforms. (Although DB2 has also worked as an XML repository since 2006, with support for XQuery and XPath, I have not been aware of any shops using it for that instead of, say MarkLogic or eXist. I assume it’s used for more transaction-oriented XML as opposed to content for publishing.) In addition to functioning as a triplestore, DB2 10.1 supports SPARQL 1.0 and a few of the more SQL-friendly features of SPARQL 1.1.

I found the free version of DB2 for Windows to be fairly easy to download and install. I didn’t have to do anything special to get my downloaded copy to support RDF; after I finished the default installation, my hard disk had a \Program Files\IBM\SQLLIB\rdf directory with a lib subdirectory full of jar files and a set of batch files that call the jar files in a bin subdirectory.

RDF application development for IBM data servers appears to be the main documentation page for DB2’s RDF support, but I used the developerWorks tutorial Resource description framework application development in DB2 10 for Linux, UNIX, and Windows as my guide to getting started—in particular, to find out about the Jena and ARQ jar files to add to the rdf/lib directory to make everything work properly.

The tutorial has you using “IBM Data Studio”, their Eclipse-based DB2 administration interface, after you finish your initial setup, and I couldn’t get certain menu choices described by the article to show up in the copy of Data Studio that I downloaded, but with some generous email help from the article’s lead author, Mario Briggs, I managed to ultimately do everything I wanted to without Data Studio.

(The developerWorks article is actually just Part 1, and I look forward to Part 2. Remember, though, that the article is more oriented toward explaining RDF to DB2 users than vice versa, and it also assumes that your main use of DB2’s RDF storage will be from Java code that you write yourself. I limited myself to the batch files in the bin directory and two that Mario sent, and did manage to load and query some data.)

The “Prerequisites for creating RDF stores” section of the tutorial article lists some very technical setup details to perform, but step 2 after that describes a script that takes care of these steps for you—for example, by creating the DB2 database RDFSAMPL that each of my command line examples below refer to. (Note that the script is called dbsetup.sql, not setup.sql, as the article currently says. Also, in Windows 7, you can’t do this in just any command line window, but must do it from one opened by right-clicking a command line window icon and picking “Run as administrator”.) That was not the first time that I did something specified by the article, saw that it didn’t work, and then read in the paragraphs after that about changes to make to the displayed command to make it work with my configuration. So, if you get stuck in the tutorial, read ahead a little before you get too frustrated.

If you run a batch file from \Program Files\IBM\SQLLIB\rdf\bin with no parameters, it displays help about the available parameters, so that will tell you more details about the steps that I executed below.

Once I had the RDFSAMPL database defined using the dbsetup.sql script, running the following command from the bin directory mentioned above created an RDF store in RDFSAMPL named myrdfstore (I had set the password values when I first installed DB2):

createrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password

The bin directory includes a createrdfstoreandloader.bat batch file to create and load data at once, but I usually used the loadrdfstore.bat batch file (available here with “.txt” added to the filename for easier downloading) that Mario sent me. For example, this next command loaded some data into that RDF store and gave a report about how many triples were loaded:

loadrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password \temp\ex029.rdf

Right now, DB2 can load RDF/XML and ntriples files, but not Turtle. As far as I can tell, without custom Java coding there is currently no way to add triples to an RDF store that already has triples in it or to add triples to named graphs. See the documentation for more on the relevant Java libraries and calls.

Another short yet crucial batch file that Mario sent me was queryrdfstore (available here). This next command uses it to run the query shown and displays the results along with a count of the milliseconds it took:

queryrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password "SELECT DISTINCT ?p WHERE { ?s ?p ?o }"

(Keep in mind that the files that Mario sent me may not work with future versions of DB2’s RDF support; that’s why they were left out of the basic distribution. I’m sure they’ll have some sort of equivalent.) Instead of a quoted query, you can supply the name of a file with the SPARQL query stored in it:

queryrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password myquery1.rq

For now it looks like IBM isn’t that interested in selling DB2 and its RDF triplestore features to the semantic web crowd. For example, shortly before the big Semantic Technologies Conference last June in San Francisco, semanticweb.com’s Eric Franzon interviewed IBM Director of Strategy and Marketing for Database Software and Systems Bernie Spang in an article titled RDF Support in IBM’s DB2. Spang talked more in big picture terms, which is his job, and the article concludes by pointing out that IBM is a Gold Sponsor of the San Francisco conference. However, when I went to the IBM booth at the conference to ask about the RDF triplestore support in DB2, the two guys in the booth were genuinely surprised to hear that this had been added to DB2. (They were there to sell IBM’s Enterprise Content Management product.) They did give me some excellent wind-up IBM robots, though.

When I see a title of “DB2-RDF (NoSQL Graph) Support in DB2 LUW 10.1” on another page on the developerWorks site, I can better see the logic of IBM’s approach: they’re saying “Hey, we can do NoSQL”, a message that can appeal to a bigger audience than a marketing effort focused on us semantic web geeks, especially when you consider the huge base of existing DB2 users who are wondering about the new database technologies getting the most buzz lately.

I’m still very happy that IBM chose to go with a W3C standards-based approach to supporting NoSQL graph databases. I especially appreciate this direction because a lot of the NoSQL crowd seems unaware of what RDF and SPARQL technology can offer them. (Why, and what can we do about it? That’s another blog entry, but feel free to add comments here with your own theories.) I just think it’s great that I can store and query RDF on my laptop using one of the most respected database management packages without spending a dime, and that if I really want to scale up, I can do it with the same software on an IBM mainframe.

Properties

Bob DuCharme — Tue, 31 Jul 2012 09:05:39 -0500

Children's edition.

Going through some old files, I found a homework assignment that my younger daughter did seven or eight years ago. When doing RDF-related data modeling you put a lot of thought into properties, and I remember getting a kick out of this introduction to the concept when she brought it home.

The smiley face shows that on a later page she did well on the worksheet that evaluated how well she understood this. By now, I think she would understand datatype vs. object properties if she was interested, but so far she’s not.

The second page, not shown here, is titled “Properties Can Change,” a topic that continues to vex data modelers. In my own future data modeling, in addition to asking myself things like “is there a popular subproperty of rdfs:label that I should be using here?” I will also make a point of asking myself “Is it hard as a rock or as soft as a dream?”

Reclaiming my picture metadata from flickr

Bob DuCharme — Tue, 26 Jun 2012 19:54:44 -0500

Surprise: by converting multiple sources of data to triples and then running a SPARQL query.

...a pretty nice example of how triples and SPARQL can make quick and dirty data integration easy even when the data in question isn't necessarily stored as triples.

We should give flickr some credit for providing an API that lets us download the metadata we’ve entered about our pictures (for example, titles, descriptions, and membership in custom sets such as XML Summer School 2011 or Artsier Stuff) but that metadata all refers to pictures on flickr’s servers. What if I want to use blurb.com to print a hardcopy album of one of these sets? Do I have to download that set’s pictures from flickr, even though I already have them on a hard disk, because I don’t know which ones on my hard disk correspond to the ones in that set on the flickr server?

As it turns out, no. The general question is this: how do I connect metadata that I’ve entered on flickr.com with the files on my local hard disk? Assuming that I never took two different pictures in the same millisecond, I can use the date-time stamp stored inside of each JPEG image file as a unique ID (or, in more OWLish terms, as an inverse functional property, although I didn’t actually use owl:InverseFunctionalProperty anywhere and just let SPARQL do the work), so here’s what I did:

I used the flickr API to download the metadata about all the pictures that I have stored there, including set membership. This data was all in XML, so I then used some XSLT to convert that to Turtle RDF.
I used Apache Tika (an open source toolkit I’ve written about here before) to pull out metadata about all the pictures on my hard disk as JSON. (I could have asked Tika for XMP, which would give me RDF, but asking for JSON gets you more data.) I then used some JavaScript to convert this JSON to Turtle RDF. For the file \My Pictures\2012-01-12\IMG_5907.jpg, I created a IMG_5907.jpg.ttl file where the subject of all the triples is the URI http://www.snee.com/bob/pics/id/2012-01-12/IMG_5907.jpg.
I loaded all this RDF into a triplestore and then ran the query shown below, which (in this case) showed me the URIs for the image files on my hard disk that corresponded to each picture stored in my flickr “Artsier Stuff” set:

PREFIX dc:   <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bf:   <http://snee.com/ns/flickr#>
PREFIX exif: <http://www.w3.org/2003/12/exif/ns#> 
SELECT * WHERE {
  ?ps a bf:Photoset ;
      dc:title "Artsier Stuff" ;
      rdfs:member ?memberPic .
      ?memberPic dc:title ?picTitle ;
      bf:dateTaken ?flickrDate. 
   OPTIONAL { ?diskPic exif:dateTimeOriginal ?flickrDate . }
}
ORDER BY ?flickrDate

The query finds the bf:dateTaken value of each picture from that set, then looks for a local disk file with that same date-time stamp. I put that last bit in an OPTIONAL pattern because I wasn’t sure whether it would successfully find local versions of all the files, and wanted to see which ones it had trouble with. As it turned out, it didn’t have trouble with any of them, which was great to see.

Finding those URIs was handy for gathering up local copies of pictures from a given set. Other queries could retrieve the title, description, and other data associated with any set of flickr pictures and show the disk files that they went with.

The whole thing was a nice example of how triples and SPARQL can make quick and dirty data integration easy even when the data in question isn’t necessarily stored as triples. As an added bonus, the metadata remains meaningful even if I stop paying my subscription fee to flickr and lose access to metadata for all but 200 pictures, which is what happens when you scale back to a free Flickr account.

Trying out SPARQL 1.1's COPY and MOVE operations

Bob DuCharme — Sun, 03 Jun 2012 11:13:04 -0500

Copying and moving triples between graphs, named or otherwise.

SPARQL 1.1 Update’s COPY and MOVE operations let you copy and move triples between named graphs or between the default graph and a named graph. These operations first appeared in the May 2011 SPARQL 1.1 Update draft, but with the recent 0.2.2 snapshot release of Fuseki I find I can try their full range of capabilities a little more than I could with the 0.2.1 incubating release of Fuseki.

The spec’s description of COPY and MOVE show that neither truly adds anything to the SPARQL Update language; each is a shortcut to a wordier query request that combines DROP and INSERT operations. I think the new ones will be handy.

To try them, I first ran the update request at http://www.learningsparql.com/examples/ex338.ru in Fuseki to create some data to serve as a baseline. This update request inserts two triples in the default graph, two in the named graph d:g1, and two in the named graph d:g2. Running the query that lists all triples in all graphs got me this (with prefixes substituted for the original base URIs to more easily fit the output on this page):

---------------------------------
| g    | s   | p      | o       |
=================================
|      | d:x | dm:tag | "one"   |
|      | d:x | dm:tag | "two"   |
| d:g1 | d:x | dm:tag | "three" |
| d:g1 | d:x | dm:tag | "four"  |
| d:g2 | d:x | dm:tag | "five"  |
| d:g2 | d:x | dm:tag | "six"   |
---------------------------------

The COPY operation copies triples from one graph into another, replacing any existing triples in the destination graph. (To quote the spec, “If the destination graph does not exist, it will be created.”) The following update request copies the triples from the default graph to graph d:g2:

PREFIX d:  <http://learningsparql.com/ns/data#>
COPY DEFAULT TO d:g2

Pretty simple. After running it, running the query that lists all graphs and triples shows that the “five” and “six” triples that were in the d:g2 graph are no longer there and that the “one” and “two” triples are there and still in the default graph as well:

---------------------------------
| g    | s   | p      | o       |
=================================
|      | d:x | dm:tag | "one"   |
|      | d:x | dm:tag | "two"   |
| d:g1 | d:x | dm:tag | "three" |
| d:g1 | d:x | dm:tag | "four"  |
| d:g2 | d:x | dm:tag | "one"   |
| d:g2 | d:x | dm:tag | "two"   |
---------------------------------

The MOVE operation moves triples from one graph to another, also replacing existing triples in the destination graph. Again, if the destination graph doesn’t exist, it will be created. The following update request moves the triples in graph d:g2 to graph d:g1:

PREFIX d: <http://learningsparql.com/ns/data#>
MOVE d:g2 TO d:g1

When run against the result of the COPY update request above, the result shows that there’s nothing left in d:g2 and that d:g1 has the triples that used to be in d:g2:

-------------------------------
| g    | s   | p      | o     |
===============================
|      | d:x | dm:tag | "one" |
|      | d:x | dm:tag | "two" |
| d:g1 | d:x | dm:tag | "one" |
| d:g1 | d:x | dm:tag | "two" |
-------------------------------

You can see more variations on these two operations in the SPARQL 1.1 Test Suite, but the tests are basically different combinations of moving triples between default and named graphs, pre-existing or otherwise.

Like I said, neither of these operations add anything to SPARQL Update that couldn’t be done without them, but I would venture a guess that by making it easier to manipulate the relationships between triples and named graphs, the SPARQL Working Group is encouraging developers to use named graphs more as part of their application architectures. I look forward to asking some of them about this at the Semantic Technology and Business conference in San Francisco this week. And now it’s off to the airport…

Reuse? Ha!

Bob DuCharme — Mon, 28 May 2012 13:10:35 -0500

Reuse is Good, especially when you reuse my work.

I laughed when I found the container shown here in our house, because it demonstrates an overly common attitude about reuse of software, right down to the sanctimonious tone: everyone agrees that reuse and recycling is good, so you should reuse this thing that we custom-designed for our particular needs.

Now, maybe the container is built of recycled plastic, in which case it has some practice behind its preaching, but the idea of demonstrating your commitment to the lofty principle that Reuse is Good (friendly to Planet Earth! With a capital “P” and “E”!) by insisting that others reuse your bespoke work is common in software development. There’s a good reason for this: finding code that suits your needs is often more work than just writing the code that does the job you need done. The developers who insist that others should reuse their fabulous code often didn’t think beyond their own specific needs when designing it, skipping the steps of generalizing the tasks performed to a wider range of related needs and of course documenting their work so that people understand how to fit their work to other needs.

This has been an issue since people began preaching about source code reuse over thirty years ago, and it drove much of the popularity of object-oriented analysis and design. It’s interesting that this is less of a problem with semantic web technology, for two reasons that I see:

Reuse of pieces of other work is much easier. If I just want to use a little bit of your ontology or schema or vocabulary, I can, and with RDF at the bottom layer of all of this, aggregation of pieces from multiple otherwise uncoordinated sources is much easier than it is with other technologies that advocate reuse, particularly programming and markup languages.
We can do retroactive reuse. Let’s say I declare and use my own bd:photographer property for image metadata. Later, I notice that Dublin Core has a creator property, and only then decide that mine should have been a subproperty of that. I can just add the triple bd:photographer rdfs:subPropertyOf dc:creator to my data and still reap the benefits of reuse: applications that don’t know about my bd:photographer property but do know about the more famous Dublin Core one will have a clue about the semantics of my property (that is, that a photographer is a kind of creator of an image) and can treat my property as a stand-in for the dc:creator property.

Essentially, the ease with which we can loosely join small pieces of ontologies and RDF schema makes it much easier to use semantic technology to form a semantic web.

(A side note on object-oriented work: today everyone thinks that paradigm shift just means “big change.” When science historian Thomas Kuhn coined the term, he was describing a gradual fading away of research in a given problem area as people worked on new areas and the other one got left behind. Has object-oriented analysis and design faded away as an active area for computer science researchers since its heyday in the eighties? I think so. Try going to the Association for Computing Machinery’s www.oopsla.org website and you’ll be redirected to the “Systems, Programming, Languages and Applications: Software for Humanity” conference that has folded it in. That conference has an OOPSLA track, but in the titles of the 61 papers presented in that track in 2011, the word “object” only comes up three times. Of course, object-oriented principles drive the code development of Java and several currently popular programming languages, so these principles are still going very strong, but I find it interesting that active research in the area has faded to such a small fraction of where it used to be.)

Simple federated queries with RDF

Bob DuCharme — Sun, 29 Apr 2012 18:22:31 -0500

A few more triples to identify some relationships, and you're all set.

Easy aggregation without conversion is where semantic web technology shines the brightest.

Once, at an XML Summer School session, I was giving a talk about semantic web technology to a group that included several presenters from other sessions. This included Henry Thompson, who I’ve known since the SGML days. He was still a bit skeptical about RDF, and said that RDF was in the same situation as XML—that if he and I stored similar information using different vocabularies, we’d still have to convert his to use the same vocabulary as mine or vice versa before we could use our data together. I told him he was wrong—that easy aggregation without conversion is where semantic web technology shines the brightest.

I’ve finally put together an example. Let’s say that I want to query across his address book and my address book together for the first name, last name, and email address of anyone whose email address ends with “.org”. Imagine that his address book uses the vCard vocabulary and the Turtle syntax and looks like this,

# addressBookA.ttl


@prefix v:   <http://www.w3.org/2006/vcard/ns#> .
@prefix aba: <http://learningsparql.com/ns/abookA/data#> .        


aba:rick v:given-name "Richard" ;
         v:family-name "Mutt" ; 
         v:email "rick@selavy.org" . 


aba:al   v:given-name "Alan" ;
         v:family-name "Smithee" ; 
         v:email "alan@paramount.com" .

and mine uses the FOAF vocabulary and looks like this:

# addressBookB.ttl 


@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix abb: <http://learningsparql.com/ns/abookB/data#> .        


abb:bill   foaf:givenName "Billy" ;
           foaf:familyName "Shears" ; 
           foaf:mbox "bill@northernsongs.org" . 


abb:nate foaf:givenName "Nanker" ;
           foaf:familyName "Phelge" ; 
           foaf:mbox "nate@abkco.com" .

Note that, in addition to the property names being different in the two address books, his properties, my properties, his data, and my data come from four different namespaces.

A simple CONSTRUCT query would convert one address book to use the same vocabulary that the other uses—my book Learning SPARQL includes a query that does this to convert an address book from the book’s demo namespace to vCard—but to address Henry’s question, I wanted to show how we can query across the two address books with no need for conversion. The key is a little bit of RDFS to define appropriate relationships between the properties used by the two address books:

# mapping.ttl


@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix v:    <http://www.w3.org/2006/vcard/ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ab:   <http://learningsparql.com/ns/addressbook#> .


foaf:givenName  rdfs:subPropertyOf ab:firstName . 
v:given-name    rdfs:subPropertyOf ab:firstName . 


foaf:familyName rdfs:subPropertyOf ab:lastName . 
v:family-name   rdfs:subPropertyOf ab:lastName . 


foaf:mbox       rdfs:subPropertyOf ab:email . 
v:email         rdfs:subPropertyOf ab:email .

I could have used this mappings.ttl file to say that the FOAF properties were subproperties of the vCard ones (or vice versa) and gotten a similar result, but because these are two independent standards that I had nothing to do with, I didn’t feel right making assertions about their relationship, even if it was for a specialized local application. Instead, I declared properties from both to be subproperties of similar ones in an address book namespace that I created myself. When adding these rdfs:subPropertyOf triples into the mix, a foaf:giveName value and v:given-name value are both ab:firstName values, so I can just query for that, and the same goes for the the values of the other properties:

#dotorg.rq


PREFIX ab: <http://learningsparql.com/ns/addressbook#> 


SELECT ?email ?fn ?ln WHERE { 
?s ab:firstName ?fn ;
   ab:lastName ?ln ;
   ab:email ?email . 
   FILTER (regex(?email, ".org$")) .
}

There is a catch: the query will only find those values if I query for them with a tool that knows what rdfs:subPropertyOf means. One such tool is the OWL reasoner Pellet. Pellet’s command line interface only accepts one data file as an argument, and I needed to combine the two address book files and the mapping file, so I executed the query with a two-line script that first concatenated the three files together (did I mention that RDF is easy to aggregate?):

cat addressBookA.ttl addressBookB.ttl mapping.ttl > combo.ttl
pellet query -q dotorg.rq combo.ttl

Here is Pellet’s answer. It found one email address in each of the two address books that ended with “org”:

Query Results (2 answers):
email                    | fn        | ln
===============================================
"rick@selavy.org"        | "Richard" | "Mutt"
"bill@northernsongs.org" | "Billy"   | "Shears"

In TopBraid Composer, including the free edition, the simplest way to combine these data files is to create another one that imports the ones you want to query together. I created one called addressbooks.ttl and dragged the three relevant files into the Imports view for that file:

(Before I explain the fourth included file: the saved addressbooks.ttl file imports the others using the standard owl:imports property. Because of this, Pellet can do the same query as above on that “single” addressbooks.ttl file, because Pellet certainly knows what owl:imports means. It’s always nice to work with a set of tools that play nice together because they conform to the same standards.)

In order to infer the extra triples implied by the relationships specified in mappings.ttl, such as that aba:rick has an ab:firstName value of “Richard”, TopBraid Composer can use several different inference engines. The TopSPIN inferencing engine is included in all editions, including the free one, and does inferencing based on SPARQL Inferencing Notation rules. The fourth file imported above, rdfsplus.ttl, contains rules (stored as triples) that implement RDFS Plus, a superset of RDFS developed by Jim Hendler and Dean Allemang that has a few extra OWL constructs thrown in. (Other SPIN rule sets are available, such as one that implements OWL RL.) Once you run TopSPIN inferencing on addressbooks.ttl’s complete set of triples, running the query above in TopBraid Composer’s SPARQL view returns the same result as the Pellet command line query earlier:

Other tools with inferencing support tend to be triple stores such as AllegroGraph, OWLIM (whose reasoning engine is another option in some versions of TopBraid Composer), Stardog, and Virtuoso. The use of a triplestore with this approach instead of three files loaded into memory together will obviously let you scale up to do it with larger amounts of data.

Here’s a nice little trick that builds on the SPIN principle of letting SPARQL do the work: although ARQ can’t do any inferencing, SPARQL 1.1 lets you build a form of inferencing right into your query. This revision of the original query uses property paths to find first and last name and email address values specified with any subproperties of ab:firstName, ab:lastName, and ab:email among the triples at hand:

# arqdotorg.rq


PREFIX ab: <http://learningsparql.com/ns/addressbook#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 


SELECT ?email ?fn ?ln WHERE { 
?firstNameProp rdfs:subPropertyOf* ab:firstName . 
?lastNameProp rdfs:subPropertyOf* ab:lastName . 
?emailProp rdfs:subPropertyOf* ab:email . 
?s ?firstNameProp ?fn ;
   ?lastNameProp ?ln ;
   ?emailProp ?email . 
   FILTER (regex(?email, ".org$")) .
}

The following command line gets the same result set that the earlier arrangements got:

arq.bat --query arqdotorg.rq --data addressBookA.ttl --data addressBookB.ttl --data mapping.ttl

Implementing the inferencing logic as part of your query like this is only going to scale up so far, but it can still be handy pretty often.

Overall, there are two important lessons here:

In terms of work, the setups I’ve described may look comparable to building and running a simple conversion routine, but once the mapping setup is done, it’s done. If I or Henry adds a new address book entry to either of our address books with stigohara@rutles.org as the email address, rerunning the query with any of these setups will find it. A big bonus is that we can each continue to use and edit our address books the same way we did before and we can still do these cross-address book queries with no need to convert anything to anything else.
A little RDFS was all it took. Years ago I wondered if anyone used RDFS without OWL, and lately the answer is a more and more emphatic Yes. The owl:imports trick above was one approach to aggregating the necessary triples, but it played no role in the mapping between the two address books that made the query of the two together possible.

So Henry: RDF and related technologies can be very useful, and the list of well-known XML people who have come to realize this is very impressive. In fact, several of them are giving XML and/or RDF presentations at this year’s XML Summer School in Oxford this September!

Playing with SPARQL Graph Store HTTP Protocol

Bob DuCharme — Sat, 31 Mar 2012 10:16:38 -0500

GETting, POSTing, PUTting, and DELETEing named graphs.

One of the new SPARQL 1.1 specifications is the SPARQL 1.1 Graph Store HTTP Protocol, which is currently still a W3C Working Draft. According to its abstract, it “describes the use of HTTP operations for the purpose of managing a collection of graphs in the REST architectural style.” Recent releases of Sesame support it, so I used that to try out some of the operations described by this spec. I managed to do GET, PUT, POST, and DELETE operations with individual named graphs, so that was fun, in an RDF geek kind of way.

Adding and deleting triples at the named graph level of granularity (as opposed to the triple level) will also make more sense for data publishing workflows where sets of data are added and deleted as a unit.

As this Working Draft often points out, you can also perform most if not all of these operations with a query sent to a SPARQL endpoint. Hardcore RESTafarians will prefer the new HTTP protocol way, though, because it uses basic HTTP operations with URIs that name resources (in this case, graphs of triples) and the operations to perform on them, instead of the more implementation-detail-oriented practice of embedding queries in URLs.

Adding and deleting triples at the named graph level of granularity (as opposed to the triple level) will also make more sense for data publishing workflows in which sets of data—probably with their own metadata about things like provenance—are added and deleted as a unit. For example, if you’re a data publisher and I’m one of your providers, I would send you a set of data to replace the current set that you’re offering from my organization, which you may have distinguished from your other data offerings in your triplestore by keeping the data from my company in its own named graph.

Maybe not enough people will agree, and find that UPDATE queries are good enough to achieve their goals. Ultimately, support for the Graph Store HTTP Protocol across the spectrum of semantic web tools will probably be tied to the extent of customer demand for it. At the very least I would expect all triplestores to support it shortly after it becomes a Recommendation, if not before.

To test drive these operations, I used cURL from the command line. cURL is part of Linux and Mac OS, and a free version for Window is available. Your favorite programming language should also offer ways to perform GET, PUT, POST, and DELETE operations—if not natively, then with some add-in library.

Everything below works, but not necessarily the best way possible. I went back and forth between the W3C specification document and the Sesame documentation on the topic a lot (with plenty of searches about cURL command line syntax in between) and I had plenty of both hits and misses. I probably missed some better ways to do several of these and I’m open to any suggestions.

Also, I have no idea what role authorization could play in all of this—you don’t want to let just anyone with HTTP access change and delete your data—but this seemed like a nice start at getting to know this new part of the SPARQL standard.

Setup

To start, I created a new repository (a Sesame term, not a W3C standard term) called updatetest. This will be important below, because the URLs to pass to Sesame must specify the name of the repository to act on.

Then, on Sesame’s SPARQL Update screen, I entered the following to insert some starter data into the updatetest repository:

PREFIX d:  <http://learningsparql.com/ns/data#>
PREFIX dm: <http://learningsparql.com/ns/demo#>


INSERT DATA
{
  d:x dm:tag "one" . 
  d:x dm:tag "two" . 


  GRAPH d:g1
  { 
    d:x dm:tag "three" . 
    d:x dm:tag "four" . 
  }


  GRAPH d:g2
  { 
    d:x dm:tag "five" . 
    d:x dm:tag "six" . 
  }
}

It adds two triples to the repository’s default graph, creates named graphs called d:g1 and d:g2, and puts two triples in each of those. (If you’re new to the use of named graphs or SPARQL 1.1 Update, which is also still in Working Draft status, see my O’Reilly book Learning SPARQL.)

To check that this update query above had the desired effect, and to see the results of the operations described below, I entered the following query on Sesame’s Query screen. It lists all the triples currently in the dataset:

SELECT ?g ?s ?p ?o
WHERE
{
  { ?s ?p ?o }
  UNION
  { GRAPH ?g { ?s ?p ?o } }
}

After you execute this query you’ll see a URL-escaped version of it embedded in the URL in your browser’s address bar. (Don’t call that RESTful, though, or the RESTafarians will come after you!) If you’re going to try many of the examples below, you might want to bookmark the result of this query or keep it in its own browser tab so that you can reload it after trying each command line below to see the command’s effect on the data in the updatetest repository.

GET

The GET examples should work when pasted as the URL into any browser, because a web browser that doesn’t support GET isn’t much of a browser. I did it with cURL anyway to be consistent with the rest of my examples. The following asks for everything in the updatetest dataset’s default graph, and gets the “one” and “two” triples:

curl http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?default

(Several command lines that I’ve pasted here may reach off to the right where you can’t see them because they’re too long. I chose not to break them up with carriage returns to make them easier to copy and paste if you want to try them.)

Sesame returns the triples in the Turtle format, but in true RESTful fashion, you can ask for the result in one of the other formats that Sesame supports:

curl -H "Accept: application/rdf+xml" http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?default

The next request asks for all the triples in named graph http://learningsparql.com/ns/data#g1. Note that graph name characters that might cause problems in URL parameters are escaped in the request:

curl http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?graph=http%3A%2F%2Flearningsparql.com%2Fns%2Fdata%23g1

PUT

An HTTP PUT is a request to put a resource at a particular URL. The idea is to create a new resource at that URL—even if something already exists there, in which case the existing resource gets replaced.

Our first PUT example puts the triples from the file test.ttl into the http://learningsparql.com/ns/data#g2 named graph, replacing any existing ones that may be there. (Note how the command line uses the cURL -X switch to indicate the operation to perform, the @ character to point to the file with the triples to send, and the -H switch to send a custom header indicating the MIME type of the data being sent.) If the http://learningsparql.com/ns/data#g2 graph didn’t exist, the PUT operation would create it.

curl -X PUT -d @test.ttl -H "Content-Type: application/x-turtle" http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?graph=http%3A%2F%2Flearningsparql.com%2Fns%2Fdata%23g2

(For the remainder of these commands, I changed something in test.ttl each time to make sure that I could see, when querying Sesame, that the latest version of the data really had been sent to the repository.) For the next query, I wanted to completely replace all of the updatetest repository’s triples with the ones in test.ttl. Based on the other working examples and the correspondences between the Sesame documentation and the standard, I thought this would work, but it didn’t:

curl -X PUT -d @test.ttl  -H "Content-Type: application/x-turtle" http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?default

This more Sesame-ish URL syntax did work to replace all of the update test repository’s triples with the ones in test.ttl:

curl -X PUT -d @test.ttl -H "Content-Type: application/x-turtle" http://localhost:8080/openrdf-sesame/repositories/updatetest/statements

(After running it, you may want to rerun the INSERT DATA update query above to more easily see the effect of the remaining operations.)

POST

While a PUT command replaces any existing triples at the named URL with the ones being sent, a POST command adds the new ones to the existing ones. The following adds the test.ttl triples to the http://learningsparql.com/ns/data#g2 named graph:

curl -X POST -d @test.ttl -H "Content-Type: application/x-turtle" http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?graph=http%3A%2F%2Flearningsparql.com%2Fns%2Fdata%23g2

I could not put together a command line that POSTed triples to the default graph, and didn’t see any examples in the Sesame documentation.

DELETE

When applied to a named graph, this command’s effect is pretty obvious. The following deletes the http://learningsparql.com/ns/data#g2 named graph and all of its triples:

curl -X DELETE http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?graph=http%3A%2F%2Flearningsparql.com%2Fns%2Fdata%23g2

This last command deletes the default graph’s triples, leaving named graphs and their triples intact:

curl -X DELETE http://localhost:8080/openrdf-sesame/repositories/updatetest/rdf-graphs/service?default

Have you tried this new specification’s operations with other tools? Does anyone see clear-cut cases where they’d rather use this than send the corresponding queries to a SPARQL endpoint, or vice versa? Let me know at this Google+ post.

Pull RDF metadata out of JPEGs, MP3s, and more

Bob DuCharme — Thu, 23 Feb 2012 08:54:10 -0500

With open source Apache software.

I’ve been having some fun with Apache Tika lately. According to its homepage, the “Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.” It managed to pull some sort of metadata out of just about any file I’ve pointed to; the list of formats it can handle includes PDF, JPEG, MP3, ePUB, Flash video files, Microsoft Office files, OpenOffice files, and more. What’s especially cool to me is that it can extract RDF.

Tika can run as a server, or as GUI window that you drag files to, but I’ve mostly played with the command line version. Running it with an argument of --help lists the available output options:

java -jar tika-app-1.0.jar --help

The -y option tells it to output XMP data, which comes out as RDF/XML. I’ve written here about XMP several times before. It’s basically an Adobe spec for media metadata expressed as RDF/XML. Being RDF/XML, any semantic web tool should be able to read it. The bad news is that, by explicitly targeting XMP, this Tika output only includes metadata defined in the relevant Adobe namespaces. Specifying the -j switch instead tells Tika to give you JSON output, and you get a lot more metadata. It would be nice if Tika included an -r switch to output all the metadata it can find—the same that it outputs when you request JSON output—as RDF/XML or Turtle. They’ve obviously already done the hard parts.

Why is Tika’s ability to output media metadata in RDF so interesting to me, especially if it could someday output all the same properties in RDF that it can now output in JSON? Because different media have different metadata properties (for example, an MP3 file has different metadata from a JPEG file) and one of the greatest strengths of the RDF data model is the way it lets you accumulate property-value pairs for resources without knowing which properties you’re going to gather in advance. So, let’s say I wanted to create an application around a single set of metadata that describes a particular collection of images, music files, and related documentation. Tika plus a few selections from a wide variety of standards-compliant semantic web software, such as TopQuadrant’s TopBraid platform and many other tools, would make this almost trivial. Of course, some extra RDFS modeling around the stored properties would add more, but Tika, a triplestore, and very little else would give you enough to be off and running with a very powerful application.

A brief, opinionated history of XML

Bob DuCharme — Wed, 25 Jan 2012 09:01:51 -0500

From someone who had a front row seat.

There are a few histories of XML out there, but I still find myself explaining certain points to people surprisingly often, so I thought I’d write them down. If you don’t want to read this whole thing, I’ll put the moral of the story right at the top:

They didn't understand that it wasn't designed to meet their needs. It was designed to make electronic publishing in multiple media easier.

XML was designed as a simplified subset of SGML to make electronic publishing in multiple media easier. People found it useful for other things. When some people working on those other things found that XML wasn’t perfect for their needs, they complained and complained about how badly designed XML was. They didn’t understand that it wasn’t designed to meet their needs. It was designed to make electronic publishing in multiple media easier.

Automated typesetting and page layout…

In the 1970s, computerized typesetting made automated page layout much easier, but three guys at IBM named Goldfarb, Mosher, and Lorie got tired of the proprietary nature of the typesetting codes used in these systems, so they came up with a nonproprietary, generic way to store content for automated publishing that would make it easier to convert this content for publication on multiple systems. This became the ISO standard SGML, and the standardized nonproprietary part made it popular among U.S. defense contractors, legal publishers, and other organizations that did large-scale automated publishing.

When I first got involved, SGML was gaining popularity among publishers creating CD-ROMs and bound books from the same content, because they could create and edit an SGML version and then run scripts to publish that content in the various media. The structure of an SGML document type (for example, the available text elements and element relationships in a set of legal court cases, or the elements and element relationships that you could use in a set of aircraft repair manuals) was specified in something called a DTD, which had its own syntax and was part of the SGML standard. The scripts to convert SGML documents were usually written using a language and engine called Omnimark, which was a proprietary product, but a perl-based alternative was also available.

When Tim Berners-Lee was wondering how exactly to specify that one of his new hypertext documents had a title here, a subtitle there, and a link in the middle of a paragraph that led to another document, SGML was a logical choice—it was a text-based, flexible, non-proprietary, standardized way to specify document structure with various tools available to help you work with those documents. That’s why HTML tags are delimited with angle brackets: because SGML elements were (nearly always) delimited with angle brackets. Dan Connolly sketched out the first HTML DTD in 1992.

SGML’s designers couldn’t see into the future, so they deliberately made it very flexible. For example, you could use other delimiters for element tags besides angle brackets, but everyone used angle brackets. SGML parsing programs were still required to account for the possibility that a document used other delimiters, and the possibility that many other options had been reset, so these parsers were large and complex, and few were available to choose from. By the mid-90s, enough best practices had developed that Sun Microsystems’ Jon Bosak had the idea for a simplified, slimmer version of SGML that assumed a lot of default settings and could be parsed by a smaller program—maybe even a program written in Sun’s new Java language—and that could be transmitted over the web when necessary. The documents themselves would be easier to share over the web than typical SGML documents, following the example of HTML documents.

Around this time SGML was considered a niche technology in the electronic publishing industry, and I worked at several jobs where I wrote and modified DTDs and Omnimark scripts to create and maintain document conversion systems. I also went to the relevant SGML conferences, where I got to know several of the people who eventually joined Jon to create the simplified version of SGML. (Many are still friends.) At first this group called their new spec WebSGML, but eventually they named it XML.

You could still process XML with Omnimark and other SGML tools. Many people would fail to appreciate the value of this design decision: as a valid subset of SGML, XML documents could be processed with existing SGML technology. This meant that on that day in 1998 when XML became an official W3C standard, we already had plenty of software out there, including programs like Adobe’s special SGML edition of FrameMaker, that could process XML documents right away. This gave the new standard a running start, and XML may not have gotten anywhere without this running start, because those of us using the existing tools didn’t have to wait around for new tools for the new standard and then work out how to incorporate these tools into our publishing workflows. We already had tools and workflows that could take advantage of the new standard.

I’ve heard some people describe certain things that SGML specialists didn’t like about XML, but these people don’t understand that XML was invented by and for SGML specialists, and it made SGML peoples’ lives much easier. For one thing, we weren’t so dependent on Omnimark anymore; at least one of my former employers switched from SGML to XML just so they could ditch Omnimark. XML’s companion standard XSLT let us convert XML to a variety of formats using robust, free, standardized software, and as the web became a bigger publishing medium we found ourselves writing XSLT stylesheets to convert the same XML documents to print, CD-ROM, and HTML. Electronic publishing had never been so easy.

…and beyond…

Then along came the dot com boom. People got excited about how “seamless e-commerce” would change everything. People would save money as obsolete middlemen were removed from old-fashioned transactions, and people would make lots of money by taking part in this streamlining (selling pick axes during a gold rush) or by automating the buying and selling of products.

Orders would be transmitted over this fabulous free network known as The Internet instead of over the expensive, proprietary EDI networks. But when my computer sent an order to yours, how exactly would this order be represented? XML provided a good syntax: it was plain text, easy to transmit and parse, and could group labeled pieces of information in fairly arbitrary structures while remaining an open, straightforward standard. (When I say “straightforward”, I’m talking about the original spec here, not the collection of related specs that most people are referring to when they complain about the complexity of XML. More on this below.) This let people send any combination of information back and forth, regardless of the potential lack of compatibility between the back end systems that the different parties were using.

So, as an important technology of the dot com boom, XML became trendy, and it was a heady feeling to suddenly be an expert in a trendy technology. I’ll never forget hearing it mentioned in a Microsoft ad on a prime time network TV show; sure, it was spoken by the character of a geek who normal people weren’t supposed to understand, but still, this subset of a niche technology that my friends help to invent was mentioned on prime time network TV. Three different series of XML conference series were running, and they were much better attended than the single one that’s left now. The best part was that there was enough money behind some of those conferences to fly most speakers in and put them up in hotels, which got me my first trips to London and Silicon Valley.

XML wasn’t really a perfect fit for ecommerce systems, though. The elements vs. attributes distinction, which publishing systems used to distinguish between content to publish and metadata about that content, didn’t have a clear role when describing transactions that weren’t content for publishing. XML had some odd data types (NMTOKEN? CDATA?) that only applied to attribute values, instead of traditional data types like integer, string, and boolean that could be applied to content as well as attributes.

And then there was that strange DTD syntax: if XML was so good at describing structure, why wasn’t XML used to describe the structure of a set of documents? The answer is above, but it didn’t get publicized very well, so many people complained about DTD syntax. Everyone agreed that an XML-based schema syntax that provided for traditional data types would be a Good Thing, so various groups came up with proposals and the W3C convened a Working Group to review these proposals and come up with a single standard.

But, in the words of Cindy Lauper, money changes everything. XML itself was assembled by eleven specialists in a niche technology, SGML, that wanted to make standardized electronic publishing simpler, and they managed to stay under most radar systems and come out with something simple and lean. However, when the XML Schema Working Group convened, many big and small companies were smelling lots of money and wanted to influence the results. Of the 31 companies that sent representatives to this Working Group (31!), many had little or nothing to do with publishing, electronic or otherwise. There were database vendors such as Microsoft, Informix, Software AG, IBM and Oracle (to be fair, large software companies have always been up there with legal publishers and defense contractors as believers in automated publishing technology; note where SGML got its start). There were successful or aspiring B2B ecommerce vendors such as CommerceOne, Progress Software, and webMethods. Microsoft, Xerox, CommerceOne, IBM, Oracle, Progress Software, and Sun were each interested enough to send two representatives to the committee, so there were a lot of cooks working on this broth.

The result was a three-part specification: Part 0 was a primer, Part 1 specified how to define document structures, and Part 2 described basic data types and how to extend them. Part 2 is pretty good, and also provides the basis for RDF data typing. Part 1, in my opinion, ended up being an ugly, complicated mess in its attempt to serve so many powerful masters.

Two members of the original eleven-member XML team, James Clark and Makoto Murata, developed an alternative to Part 1 that was both simpler and more powerful called RELAX NG schemas. Clark had written the only open source SGML parser, and the first XSLT processor, and came up with the name “XML,” among his many other achievements; he’s also written some great software to implement RELAX NG and convert between schema formats. RELAX NG never became as popular as XML Schema, because it didn’t have the big industry names behind it, and because it was optimized around the original XML use case: describing content for publication.

Despite a complex syntax, incompatibilities among parsers, an often inscrutable spec, and less expressive power than RELAX NG, the W3C XML Schema specification has become popular because it’s a W3C standard that addresses the original main problems of XML for ecommerce: it specifies document structures using XML, it lets you use traditional datatypes, and it has the added bonus for many developers of making it easier to round-trip XML elements to Java data structures. (After railing against the influence of this last part for years, I learned that it was primarily the work of Matthew Fuchs, an old friend I’ve known since he was finishing up his Ph.D. in computer science at NYU’s Courant Institute when I was doing my masters there in the mid-nineties. He was the only other person there who even knew what SGML was.) So, XML Schema continues to be used by many large organizations to store data that doesn’t fit neatly into relational tables. In fact, TopQuadrant has been adding more and more features to the TopBraid platform to make it easier to incorporate such data into a system that uses semantic web standards.

…and back.

Getting back to to the topic of leaner, simpler alternatives for representing information of potentially arbitrary structure, the JavaScript-based JSON format started getting popular around 2006. The third paragraph of its Wikipedia page flatly states that “it is used primarily to transmit data between a server and web application, serving as an alternative to XML.”

A Google search for “json replace xml” gets over 5,000 hits. (That’s with the quotes around the search terms, to make Google search for the exact phrase. Without the quotes, it gets almost five million hits.) I like JSON, and see how it can replace many of the uses of XML that have been around since the dot com boom days, but anyone who thinks it can completely replace XML doesn’t understand what XML was designed for. Documents with inline markup (or, in XML geekspeak, “mixed content”—for example, the way the HTML a element can be in the middle of a sentence within a p element) would theoretically work fine in JSON, but in practice, it would be too easy to screw it up when editing it with a text editor by accidentally adding or removing a single curly brace. Tools to hide the syntax behind a more intuitive interface may address the issue, but dependence on such tools was something that the original XML designers wanted to avoid. And frankly, when I picture a complex prose document stored in JSON, I hear the ghost of Microsoft’s RTF dragging chains through the attic.

Between JSON’s growing role as an inter-computer data format and RELAX NG’s foothold in schemas like DocBook and companies like LexisNexis, I see the XML infrastucture getting back to its original use cases, which makes good sense to me. Each year at the XML Summer School in Oxford, it’s been very interesting to see the new things people are doing with XML, especially as XQuery-based XML databases like MarkLogic and eXist grow in power. I’ve been chairing the semantic web track at the summer school for the past few years and hardly been involved in XML at all, but it’s always great to hear what my old friends are up to. Especially when there’s great beer available.

Having a Blue Ridge Christmas

Bob DuCharme — Fri, 16 Dec 2011 09:56:01 -0500

They're playing my song!

A few months ago I saw a call for contributions of recordings of original holiday songs for a CD to be called “A Charlottesville Songwriters Christmas” to benefit a local charity. Around here there seems to be a law that when you name a business you have to name it either Jefferson (whatever), Piedmont (whatever), or Blue Ridge (whatever), so I decided to write a song whose name is a variation on “Blue Christmas” called “Blue Ridge Christmas.” I thought about trying to put together a band to record it, but some friends who I’ve played jazz with are also in a local soul band with a really great singer (note his day job), so I offered it to them, and they made a great recording of it.

For the holiday season, the Charlottesville Downtown Business Association made a video to encourage people to shop on the downtown mall and they chose this recording as the music. It was fun for me to see it, and it’s nice to know that letting my friends hear the song won’t mean ripping it from a charity CD and putting it where people can download it. This doesn’t quite compare with my brother’s work for VW or Wendy’s, but it’s fun to know that it came out well and that lots of people can see the video—and that the song has had a bit of airplay on WNRN!

Javascript from the command line

Bob DuCharme — Mon, 21 Nov 2011 08:46:39 -0500

In Linux and Windows. (Goodbye Cscript!)

A few years ago I wrote about Windows command line text processing with Javascript using Microsoft’s Cscript utility. I was surprised to find no Linux equivalent, and while I’d heard of Mozilla Rhino I had some vague ideas about how using it only meant integrating it into other applications.

After some hunting, I learned that Rhino includes a jar file that makes it easy to run a script from the command line. Once you have it, running a script named myscript.js is as simple as this:

java -jar js.jar myscript.js

If you’re really interested in text processing, you can pipe and redirect the output.

After I downloaded Rhino and got this to work I searched my hard disk and found that js.jar was already on my hard disk in several places: with OpenOffice, with Swoop, and with Eclipse (and therefore with TopBraid Composer), so I’ve had it right under my nose for years. My brother checked his Mac and found that js.jar came with an open source speech recognizer that he had installed.

One neat part was that some fairly complex JavaScript scripts that I had run with Cscript ran with js.jar after one minor change that actually improved the scripts: instead of a print() function for basic text output, Cscript has this WScript.Echo() thing instead (WScript is a more Windows-oriented version of Cscript), so I had put the following function in my command-line JavaScript scripts:

function print(OutString) {
  WScript.Echo(OutString);
};

Because js.jar supports a native print() function, the only change necessary to any of my scripts was to comment out the three lines above, and js.jar then happily ran my existing scripts.

If you start up js.jar without providing a script name as an argument, you get a js command line. Enter help() there to see some interesting commands that you can add to your scripts—for example, readUrl(). (Note that these commands are case-sensitive.)

I mostly tested this on a Windows machine, but it all worked fine on a machine running the latest Ubuntu.

The reason I got interested in this recently was that I had just pulled a ton of menu definition JavaScript off a website, with the majority of it being JSON definitions of the website’s menu structure. I wanted to store all these definitions in SKOS RDF. Once I added and redefined a few functions in the JavaScript code that I had downloaded, I ran it all and redirected the output to RDF files all pretty easily. I’m definitely going to have some more fun with this.

Publishing academic research data

Bob DuCharme — Mon, 17 Oct 2011 13:54:58 -0500

My geeky perspective and some broader perspectives.

Along with Jo Rabin’s talk that I mentioned here earlier this month, another inspirational talk in the recent XML Summer School Trends and Transients track was “Applying XML and semantic technologies to liberate infectious disease data” by Oxford University zoology professor David Shotton. He described how, while assembling a paper on leptospira infection in urban slums, he used data and metadata from the project to create the version described in a separate paper, Semantically enhanced version of a research article from PLoS Neglected Tropical Diseases. (Note the bottom of that page, where it lets you pull down bibliographic data in your choice of RDF serializations. Also, don’t miss the semantically enhanced paper itself, and make sure to click around in it.)

After his presentation one audience member asked how an academic department with limited resources and technical background could move in this same direction without attempting to reproduce the full infrastructure, and Professor Shotton suggested that they start by putting their research data on the web along with some metadata about it. This got me thinking about Tim Berners-Lee’s Linked Data 5 Stars, a series of incremental steps toward publishing open linked data in machine-readable standardized formats. I raised my hand and suggested to Shotton that, building on his answer to that question, an alternative version of the five stars for academic researchers could provide a valuable guideline for others interested in following in his footsteps. And he’s done it! He just published The Five Stars of Online Journal Articles on his blog, which points to a longer version of the article that he’s submitted to Nature. My original idea was more of a revision of Berners-Lee’s original five stars, but Shotton drew on his extensive academic publishing experience to bring in a lot of bigger-picture issues such as peer review and specific repositories that could host such data.

I had been thinking about the potential of academic researchers publishing data using Linked Data principles before this year’s XML Summer School; one reason I started the Charlottesville Semantic Web Meetup was to find people at the University of Virginia who were interested in pursuing this. I recently learned about someone else who’s been thinking hard about issues around publication of research data: UCLA’s Christine Borgman, whose paper The Conundrum of Sharing Research Data appeared in the June issue of the Journal of the American Society for Information Science and Technology. (Click “One-Click Download” on that page to retrieve the paper itself.)

As I realized when I read David Shotton’s article, I’ve been focused on the technical issues, but there are many others to consider. Here are a few quotes from Borgman’s abstract:

This article explores the complexities of data, research practices, innovation, incentives, economics, intellectual property, and public policy associated with the data sharing conundrum.

Rationales for sharing data vary along two dimensions: whether motivated by research concerns or by leveraging public investments, and whether intended to serve the interests of researchers who produce data or the interests of potential re-users of data.

Four rationales for sharing research data are identified and positioned on these dimensions. Researchers’ incentives to share their data depend not only on these rationales, but on characteristics of their data and research practices, funding agency policies, and resources for data management. Much more is understood about why researchers do not share data than about when, why, and how researchers do share data, or about when, how, and why researchers or the public reuse data. The model and research agenda are illustrated with examples from the sciences, social sciences, and humanities.

Here’s one quote from the main body of the article:

If the rewards of big data are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others. Underlying this simple statement are thick layers of complexity about the nature of data, research, innovation, and scholarship, incentives and rewards, economics and intellectual property, and public policy.

Her paper goes on to describe these layers. And, I have to love any academic paper that refers to a “dirty little secret.” I’ll let you find that part yourself. While Borgman’s paper doesn’t get down to the level of data models and serializations for sharing data, if you’re at all interested in how Linked Data may benefit the academic research world, her paper is really worth reading.

Displaying SPARQL results on a mobile phone

Bob DuCharme — Tue, 04 Oct 2011 09:31:53 -0500

Nicely.

The ability to create mobile-native web apps with SPARQL and simple XSLT stylesheets should open up a lot of possibilities.

Jo Rabin’s “Mobile is not The Future (It’s Now)” presentation in the Trends and Transients portion of this year’s XML Summer School (and the reading he suggested, such as this Tomi Ahonen blog post) got me thinking much harder about mobile delivery. One of my first ideas was how easy the jQuery Mobile Javascript library could make it to deliver SPARQL query results, and in less than 30 minutes I wrote an XSLT stylesheet that can take the SPARQL Query Results XML Format version of any SPARQL query result and use this library to render the results nicely for mobile phones.

A SPARQL query that SELECTs more than one variable returns a two-dimensional grid of information, but a more one-dimensional display works better on phones, so the initial display created by my stylesheet is a series of buttons that show the values of the first selected variable. Clicking one displays the values that go with it—the values that would have been the rest of its row in a two-dimensional display. Below, on both an LG Ally running Android and on an iPhone, you can see the stylesheet’s rendering of DBpedia’s results from a query for the name, artist, release date, and URI of albums produced by Timbaland. Below that you can see the same thing on the Ally after I turned the phone sideways. (Click either image to see a larger version.) You can see the results of the query in your own browse formatted for a mobile here; for context (and to see the actual query) see the DBpedia default display of the results.

For another demo query, I asked DBpedia for the names, revenue figures, foundation year, and descriptions of CRM vendors. Compare the version formatted for mobiles with the default DBpedia display.

A few issues to keep in mind:

The display includes variable names with each value to show what that value represents (for example, albumName and releaseDate in the pictures above), but you could customize the stylesheet to display the text any way you like, especially if you planned on using it with a specific dataset. For example, you could omit the variable names or have your query provide rdfs:label versions of them to use instead.
Long strings of text with no spaces to wrap, like the album URLs in the Timbaland query results, may not look great, but I included the albumURL one in that query just to make sure that my stylesheet would render them as working hypertext links.
If your first variable represents a resource URI instead of a literal value, it won’t be a hypertext link in the displayed page, because pressing the button with each result row’s first value expands or contracts the display of the rest of the row’s values. It makes more sense to have human-readable values and not URIs on the initial display’s buttons anyway.
If your query retrieves a lot of data, the stylesheet creates a big HTML file, and the button response may be slow on your phone, especially if the model is as old as my LG Ally.

I’ve read a little about jQuery, but I didn’t need any of what I learned from that reading to create this stylesheet. If you’re happy with the effects of a particular jQuery library, using it may mean no more than creating some simple HTML (typically, some ul, table, and div elements) with specific attributes set for them so that the right jQuery code affects the right elements. To design the pages created by my stylesheet, I just viewed the source and followed the model on the collapsible content page of the jQuery Mobile site.

The XML format SPARQL query results format is a model of elegant simplicity compared with RDF/XML. (Granted, it has a much simpler job to do.) Writing code to process it in any language is usually easy. If you’re new to XSLT, then with some bias I can recommend a book on XSLT that has helped many people I know learn it quickly.

The ability to create mobile-native web apps with SPARQL and simple XSLT stylesheets should open up a lot of possibilities, because semantic web and linked data application architectures ranging from simple batch files to TopBraid’s SPARQLMotion let you hand off XML format SPARQL query results to an XSLT processor. (It should work with the SNORQL interface to Linked Data Cloud datasets such as DBpedia, where the input form lets you specify your own XSLT stylesheet to run, but this feature is currently disabled on the DBpedia Virtuoso instance. It will be great if they enable it or include a similar stylesheet among the installed choices; meanwhile, you can retrieve the XML results and run the XSLT on your own system.)

2011-10-05 update: with Kingsley Idehen’s help, I now know how to query DBpedia with my own (or any other) XSLT stylesheet. Remove the carriage returns from the following and replace the &query parameter value as described:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org
&query=REPLACE-WITH-ESCAPED-QUERY
&format=application%2Fsparql-results%2Bxml&save=display&fname=
&xslt-uri=http://snee.com/sparql/xslt/SPARQLMobileResults.xsl

For example, this query asks DBpedia for the Beatles’ names, aliases, birth dates, and death dates, and formats the results with the spreadsheet described above.

Note on comments: after turning off comments on this blog for a few days because of comment spam, turning them back seems to have no effect. So, inspired by Jeni Tennison, I’ll ask you to add any comments to this Google+ post.

RDFa can be so simple

Bob DuCharme — Tue, 16 Aug 2011 08:19:12 -0500

Despite claims to the contrary.

You can write simple, parsable RDFa with very little syntax and trouble. Really.

I got so tired of hearing people complain about how confusing RDFa is that while I was on hold during a recent phone call I threw together a demo of just how simple it can be. The document has the two basic kinds of triples: one with a literal for an object, with data typing thrown in for good measure, and one with a resource URI as its object. A View Source of that document will show this in its head element (namespaces are declared earlier):

    <meta about="http://www.snee.com/bob/foaf.rdf#bob"
          property="foaf:givenName"
          content="Bob"
          datatype="xsd:string"/>


    <meta about="http://www.snee.com/bob/foaf.rdf#bob"
          rel="foaf:homePage"
          href="http://www.snee.com/bob"/>

This link will show you the triples as extracted by the W3C’s RDFa Distiller and Parser service.

My little demo doesn’t take into account all the swirling attempts to innovate, accommodate, and disassociate various ideas about embedding machine-readable markup that are currently out there (if you want to stay on top of this, read Jeni Tennison’s blog), but it highlights a principle that is probably older than FORTRAN: parsing data in a particular syntax can be a big job, because the parser must understand the full language, but writing data in a particular language can be simple because you can pick the subset that you prefer to work with.

RDFa gives you many more options for embedding triples—especially if you want to embed metadata about content this is already part of an HTML page, which seems to be a key original use case, or about the page itself—but you can write simple, parsable RDFa with very little syntax and trouble. Really.

(Note on comments: after turning off comments on this blog for a few days because of comment spam, turning them back seems to have no effect. If you send me an email about what I’ve written at snee.com (bob), I’ll add it and any response here.)

"Learning SPARQL" now available

Bob DuCharme — Wed, 27 Jul 2011 08:12:53 -0500

In print and ebook formats.

I’m very happy to announce that the ebook and print editions of Learning SPARQL are now available from O’Reilly. Print editions are also available from amazon.com, amazon.co.uk, maybe some more Amazons, and Barnes and Noble. (Borders says that it’s on backorder, but I wouldn’t hold your breath for that.) You can read more about how I came to write the book in an earlier blog posting.

Right now it’s the only complete book on the W3C standard query language for linked data and the semantic web, and as far as I know the only book at all that covers the full range of SPARQL 1.1 features such as the ability to update data. The book steps you through simple examples that can all be performed with free software, and all sample queries, data, and output are available on the book’s website. In the words of Priscilla Walmsley, “It’s excellent—very well organized and written, a completely painless read. I not only feel like I understand SPARQL now, but I have a much better idea why RDF is useful (I was a little skeptical before!)”

I will continue to post news about the book and about SPARQL on the book’s twitter account at @LearningSPARQL. I’m not starting a separate blog for the book, so I will continue to blog about SPARQL here.

Linking linked data to U.S. law

Bob DuCharme — Fri, 08 Jul 2011 08:29:08 -0500

Automating conversion of citations into URLs.

At a recent W3C Government Linked Data Working Group working group meeting, I started thinking more about the role in linked data of laws that are published online. To summarize, you don’t want to publish the laws themselves as triples, because they’re a bad fit for the triples data model, but as online resources relevant to a lot of issues out there, they make an excellent set of resources to point to, although you may not always get the granularity you want.

Plenty of government data references laws and related materials.

I’m discussing U.S. Federal law here, but similar principles should apply both in individual states and in other countries. The main sets of laws here are legislation, code, regulations, and court decisions. (“Code” refers to laws passed by legislation, arranged by topic; for example, laws passed about taxes are gathered into the Internal Revenue Code.) If you really want to learn about the various forms of legal material and their relationship, I highly recommend the book Finding the Law, which I found indispensable when I worked at LexisNexis.

Most law consists of narrative sentences arranged as paragraphs, often with metadata assigned to certain blocks of it. It’s such a good fit for XML that legal publishers were among the first users of XML’s predecessor, SGML. (Their use of XML and SGML account for a large chunk of my career, and I know that some old XML friends like Sean McGrath and Dale Waldt continue to make great contributions in this area.) So, while you wouldn’t get much benefit splitting these sentences and paragraphs into subjects, predicates, and objects and publishing them as triples, plenty of government data references laws and related materials, and it’s more helpful if they can reference them with URLs that lead to the actual laws. To add these URLs with any kind of scalability, you need to find out the common format for citing a document (or, if possible, a point within a document) and an online source of those legal documents whose URLs can be built from that citation format with a regular expression or some other automated tool.

When creating links to any specific bits of U.S. law, the most valuable book is The Bluebook: A Uniform System of Citation. As the subtitle implies, the book describes the normalized way to refer to legal documents and their components. Once you know these, a regular expression can often turn them into a URL that leads a browser right to the part you want. For example, while people often refer to the Supreme Court case outlawing school segregation as “Brown v. Board of Education”, its official name is “347 U.S. 483”, which means “the case beginning on page 483 of volume 347 of the official publication of U.S. Supreme Court decisions”.

While there are several sites hosting Supreme Court decisions out there, notably Cornell Law School’s Legal Information Institute, the one whose URLs are easiest to construct from a proper Supreme Court citation are at justia.com, where the URL for Brown v. Board of Education is http://supreme.justia.com/us/347/483/case.html. (See also my favorite case, Campbell aka Skyywalker et al v. Acuff Rose Music, Inc. at http://supreme.justia.com/us/510/569/case.html. Make sure to listen to the relevant work on YouTube while you review it.) If you’re really interested in linked data and U.S. Supreme Court cases, DBpedia has lots of great metadata for many important cases, as I wrote about in Court decision metadata and DBpedia.

To create a URL for other U.S. court systems, you’ll have to look up the proper way to cite them in a resource like the Bluebook and then look for versions of that court’s cases online with URLs that reflect the citation in a manner that lets you automate the creation of the URL. This is a theme for linking to any kind of law on the web, and you can be sure that developers at the Legal Information Institute, LexisNexis, WestLaw, and other legal publishers have put plenty of time into developing regular expressions to make this happen so that they can turn plain text citations into hypertext links. (It would be great if the LII made their regular expressions public. LexisNexis and WestLaw never would, although they’re more interested in keeping such proprietary work away from each other than from us.)

Legislation can be more complicated, but two excellent resources make it remarkably simple: the Library of Congress’s THOMAS system lets you create persistent URLs for legislation using the handle system (see also its inventor’s web page on it), which I hadn’t heard of before the Government Linked Data meeting. The Law Librarian Blog has a nice entry showing examples of how to use it. LegisLink is another way to link to legislation, and looks simpler to me. A Legal Information Institute blog entry has a good explanation of this, and LegisLink provides an excellent form to construct the URLs. These even let you construct links to a specific section of a piece of legislation.

Granularity is an even bigger issue when linking to code and regulations, which are often broken down into numbered and lettered pieces of pieces of pieces. Ever since I worked at the grandly named Research Institute of America (a publisher of hyperlinked U.S. tax law and related information), it’s always irked me to see people refer to a pension plan as a 401K, because as subsection k of section 401 of the U.S. Tax Code (title 26 of the U.S. Code), it’s more properly written 401(k), or, to use its full name, 26 USC 401(k). The Government Printing Office lets you you link directly to section 401, if not subsection k, with the URL http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&docid=Cite:+26USC401, and the LII lets you link to it with http://www.law.cornell.edu/uscode/26/usc_sec_26_00000401----000-.html.

That’s the US Code, which arranges the laws by topic. Regulations are arranged by topic in the CFR, or Code of Federal Regulations. For example, the legal definition of bourbon is in title 27 of the CFR (Alcohol, Tobacco Products and Firearms), Part 5 (Labeling and Advertising of Distilled Spirits), section 22 (The standards of identity), subsection b (Class 2; whisky) subsubsubsection (1)(i). The full citation would be 27 CFR 5.22(b)(1)(i), but I know of no way to link to anything more specific than 27 CFR 5.22: http://edocket.access.gpo.gov/cfr_2010/aprqtr/27cfr5.22.htm. (Bookmark that on your phone’s browser and then bet a Maker’s Mark with the next barroom loudmouth that you hear insisting that bourbon must legally be made in Bourbon County, Kentucky. He’s wrong. It can be made anywhere in the United States.)

As you can see, there’s some work involved in creating URLs for links to laws, but research for this blog entry led me to new resources like LegisLink that I hadn’t heard of before, so I encourage you to let me know if there’s anything important that I’m missing.

It was also interesting to see that the LII is involved in efforts to create an international standard for legal document URIs proposed by some Italian legal researchers. (This is particularly interesting when you consider that Italian legal researchers basically invented the concept of linking 900 years ago.)

A comment from Frank Bennett of Nagoya University’s Faculty of Law:

These are indeed important developments. The systematic linking of case law and statutory data promise to have a large and positive impact on our access to legal resources. The only point I would take issue with is the reliance on Bluebook citation forms as the rosetta stone for identifying resources. Parsing cites out of plain text is a necessary kludge, given the general absence of meaningful structured metadata from online legal resources (thank you Lexis, thank you WestLaw), but it should be recognized as a kludge.

To get a lively set of service layers running on top of legal data, the metadata contained in or relevant to a particular case, statutory provision or regulatory provision needs to be readily accessible to calling applications. While it is true that string parsing machinery can be written to a good standard, assuming perfectly regular citation forms and uniform document formats, neither of those constraints applies in the wild. The Bluebook shares the field in North America with the ALWD and the McGill Guide. To make matters worse, the Bluebook specifies citation forms for some foreign legal resources that vary significantly from the native citation forms of the target jurisdictions. Document formats vary as well, so getting an accurate string parse may require special-purpose serialization of the document before applying a string parser to the text – which may be hundreds of pages in length. Although certainly better than nothing, string parsing is a fragile strategy that would be very cumbersome to standardize and does not scale well.

Matching rendered cites to URLs is an important prospect, but we won’t see significant progress at the application level until the intervening step of producing true structured metadata – and embedding it in our online resources – is covered.

A comment from Augusto Herrmann:

I just read your interesting article intitled “Linking linked data to U.S. law”. I’d like to point you to a quite successful government project that uses URN for Brazilian legislation. The portal where you can search for legislation is at http://www.lexml.gov.br and information about the project can be found on http://projeto.lexml.gov.br . There you can find the document “Parte 2: LEXML URN” which describes the rules to construct official URN for legislation and court decisions (it’s in portuguese, though). The project started circa 2004 and closely followed the footsteps of the Italian Norme in Rete project. If you aren’t yet familiar with it, it’s worth a look (see also akomantoso.organd metalex.eu).

(Note on comments: after turning off comments on this blog for a few days because of comment spam, turning them back seems to have no effect. If you send me an email about what I’ve written at snee.com (bob), I’ll add it and any response here.)

My upcoming O'Reilly book: "Learning SPARQL"

Bob DuCharme — Wed, 01 Jun 2011 10:07:13 -0500

Querying and Updating with SPARQL 1.1.

51 weeks ago at last year’s semtech I couldn’t believe that there was still no book about SPARQL available. I had accumulated notes for such a book, and by that point I’d learned enough about SPARQL as a TopQuadrant employee that I decided to start studying the specifications (and especialy the 1.1 update) more systematically and write the book myself. (This explains why I’ve been writing less on my blog in the last year and writing about SPARQL more when I do.)

I’m proud to announce that I’m publishing the book with O’Reilly. Print and electronic versions will be available in July at the latest, and we’re already planning on releasing an expanded edition with additional new material and any necessary updates once SPARQL 1.1 becomes a Recommendation. Anyone who buys the ebook version of the first edition will get the expanded edition on SPARQL 1.1 at no extra cost.

As you can tell from the book’s cover on the right, the O’Reilly animal for this one is the anglerfish—the one with the light that hangs off the front of its head, for the pun on “sparkle”. (I should really pick up the nightlight version of this lovely fish.)

From what I’ve seen so far, the only coverage of SPARQL in any existing books is a chapter or two in more general books on the semantic web, and I haven’t seen any coverage of SPARQL 1.1 in those books just yet. (The second edition of of Dean Allemang and Jim Hendler’s Semantic Web for the Working Ontologist, which is available on Amazon today, covers some SPARQL 1.1 query features, but not SPARQL Update.) “Learning SPARQL” is the first complete book on SPARQL, and covers both 1.0 and 1.1—including SPARQL Update—with working sample queries and data that you can try yourself with free software.

I parked the domain name learningsparql.com some time ago, and now there’s a full web site about the book there. For up-to-date information about the book’s availability and SPARQL news in general, subscribe to the twitter feed @LearningSPARQL.

Semantic web technology at NASA: lower costs and greater productivity

Bob DuCharme — Fri, 27 May 2011 17:54:52 -0500

An inspiring story.

Ian Jacob’s recent interview with NASA’s Jean Holm on the W3C website is an excellent case study of semantic web technology. It’s not a long article, so I recommend that you read the whole thing. Here are few points that caught me eye:

She gives nice hard numbers about money spent and money saved, and saw a downward trend of the costs.
They used publication data to infer social networks and shared expertise and found other related ways to reduce the need for staff data entry.
The use of service agreements encouraged people to share data more easily.
This sharing led to demonstrated serendipitous reuse of data.
They plan to network the vocabularies (she doesn’t use this term literally—I know it from a TopQuadrant context—but she’s clearly talking about the same thing).

It was nice to see the credit that she gave to Kendall Clark. With my TopQuadrant hat on, I wish she’d mentioned some of the extensive work that Raph Hodgson has done there, but NASA is a big organization.

After reading Danny Ayers’ Smell the coffee blog post this morning, which wasn’t very hopeful about recent progress in the semantic web, I hoped that Ian’s interview with Jeanne would cheer him up.

Using SPARQL to find the right DBpedia URI

Bob DuCharme — Tue, 17 May 2011 08:40:52 -0500

Even with the wrong name.

In Pulling SKOS prefLabel and altLabel values out of DBpedia, I described how Wikipedia and DBpedia store useful data about alternative names for resources described on Wikipedia, and I showed how you can use these to populate a SKOS dataset’s alternative and preferred label properties. Today I want to show how to use these as part of an application that lets you retrieve data even when you don’t necessarily have the right name for something—for example, retrieving a picture of Bob Marley using the misspelled version of his name “Bob Marly”.

The DBpedia page for Bob Marley shows that dbpedia:Bob_Marly is one of the dbpedia-owl:wikiPageRedirects values of http://dbpedia.org/page/Bob_Marley. This means that if you send your browser to http://en.wikipedia.org/wiki/Bob_Marly, you’ll end up on http://en.wikipedia.org/wiki/Bob_Marley.

It doesn’t show that this redirect URI has the rdfs:label value “Bob Marly”@en associated with it, and this is the really handy part for retrieving data based on not-quite-right values. Because of this, the following SPARQL query will return the URI http://dbpedia.org/resource/Bob_Marley whether the quoted literal value is “Bob Marly” or “Bob Marley”:

# First two PREFIX declarations unnecessary on SNORQL
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>


SELECT ?s WHERE {
  {
    ?s rdfs:label "Bob Marly"@en ;
       a owl:Thing .       
  }
  UNION
  {
    ?altName rdfs:label "Bob Marly"@en ;
             dbo:wikiPageRedirects ?s .
  }
}

The graph pattern before the UNION keyword checks whether there is an actual Wikipedia page for the quoted value, and the part after checks whether it’s a redirect of something else. Effectively, it will be one or the other; there are only about a dozen labels in DBpedia that can be both.

To use this in a simple application, I created a form that, after you enter a name on it, attempts to display a picture of what you entered. Because the redirect data includes common misspellings as well as nicknames, entering “Bob Marly” will get you a picture of Marley and the URL of the actual resource, as shown below the picture above. Other interesting nicknames and misspellings to try are Bob Dillan, Mary Casat, Prince Billy, Big Blue, and Proctor and Gamble. (Warning: DBpedia image data is incorrect for some very well-known people, like Abraham Lincoln and Barack Obama, even when the Wikipedia page has a picture, so you may see the symbol for a broken image link. I had hoped to have the picture above have a title of “Abe Lincon”.)

Because the output creates a specialized web page, I used the technique I described in Build Wikipedia query forms with semantic technology (which can be used with any SPARQL endpoint, not just DBpedia): a CGI Python script stores a SPARQL query, replaces a string in that query with whatever was entered in the form, sends the query off to the endpoint, and then sends HTML based on the result back to the browser. You can see the source here.

It’s safe to say that this ability to find the right information based on a nickname or common misspelling could add a lot to a lot of applications. Once again, while the most important part of the semantic web is the data—in this case, DBpedia’s wikiPageRedirects values—and not the standards and technologies used to get at the data, the existence of so much useful SPARQL-accessible data should make the SPARQL query language look more and more appealing to people who might have doubted before.

SKOS overview article on IBM developerWorks

Bob DuCharme — Wed, 11 May 2011 10:04:44 -0500

SKOS, vocabulary management, the semantic web, and more

2021-09-30: The article referenced below has since been taken off of the IBM developerWorks site, so I republished it here.

I’ve been interested in the SKOS standard for vocabulary management for several years (and written about it here several times), but since we at TopQuadrant first began planning out the Enterprise Vocabulary Net product, I’ve learned a lot more about the theory and practice of using SKOS. I’ve recently written up an overview of SKOS and where it fits into vocabulary management and the semantic web, and IBM developerWorks has just published this as Improve your taxonomy management using the W3C SKOS standard. I hope it provides useful to people who want to learn more about SKOS.

5 Comments

By John Cowan on May 11, 2011 1:50 PM

“Mutt” has a specific meaning, so it’s a bad example. Lassie is a Rough Collie, Old Yeller is a mutt (a Labrador Retriever / Mastiff cross). “Mutt” should be an en-US alternative label for the concept whose preferred labels are “mongrel” (en-GB) and “mixed-breed dog” (en-US), which would be a hyponym of “dog”.

But there’s no need to go into all that. Instead, you can just fix the article by replacing “mutt” with “pooch”.

By Simon Spero on May 11, 2011 5:35 PM

So what can you reliably infer about the relationship between [Bulldog] and [Mammal]?

By Bob DuCharme {.commenter-profile} on May 11, 2011 5:44 PM

I was going to say that Mammal is a broader term for Bulldog, but http://www.w3.org/TR/2009/REC-skos-reference-20090818/#L2810 says that " the properties skos:broader and skos:narrower are not declared as transitive properties" and that the skos:broaderTransitive property is provided to indicate such a relationship. I could use that if I wanted to more explicitly set up a taxonomy that I was defining to make it clear that I wanted Mammal to be seen as broader than Bulldog.

By Simon Spero on May 11, 2011 6:22 PM

But broaderTransitive is not supposed to be asserted, and is not guaranteed to be valid (see the SKOS Primer)…

By Thomas Bandholtz on May 13, 2011 3:19 AM

in this article you mention iQvoc with a link to a German Web page which only describes this SKOS tool. Meanwhile iQvoc 3.0 is available under an Apache 2.0 license at https://github.com/innoq/iqvoc/wiki

Best regards,
Thomas

Quick and dirty linked data content negotiation

Bob DuCharme — Mon, 09 May 2011 10:32:08 -0500

Not even that dirty.

I’ve managed to fill a key gap in the world’s supply of Linked Open Data by publishing triples that connect Mad Magazine film parody titles to the DBpedia URIs of the actual films. For example:

<http://dbpedia.org/resource/Judge_Dredd_%28film%29>
      mad:FilmParody
              [ prism:CoverDate "1995-08-00" ;
                prism:issueIdentifier
                        "338" ;
                dc:title "Judge Dreck"
              ] .


<http://dbpedia.org/resource/2001:_A_Space_Odyssey_%28film%29>
      mad:FilmParody
              [ prism:CoverDate "1969-03-00" ;
                prism:issueIdentifier "125" ;
                dc:title "201 Minutes of a Space Idiocy"
              ] .

(To prepare the data, I scraped a Wikipedia list, tested the URIs, then hand-corrected a few.) To really make this serious RESTful linked open data, I wanted to make it available as both RDF/XML and Turtle depending on the Accept value in the header of the HTTP request. All this took was a few lines in the .htaccess file (which I’ve been learning more about lately) in the directory storing the RDF/XML and Turtle versions of the data.

For example, either of the following two commands retrieves the Turtle version:

wget --header="Accept: text/turtle" http://www.rdfdata.org/dat/MadFilmParodies/
curl --header "Accept: text/turtle" -L http://www.rdfdata.org/dat/MadFilmParodies/

Substituting application/rdf+xml for text/turtle in either command gets you the RDF/XML version, and omitting the --header parameter altogether gets you an HTML version.

Here’s the complete .htaccess file:

RewriteEngine on


RewriteCond %{HTTP_ACCEPT} ^.*text/turtle.*
RewriteRule ^index.html$ http://www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.ttl [L]
# no luck:
#RewriteRule ^index.html$ http://www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.ttl [R=303,L]


RewriteCond %{HTTP_ACCEPT} ^.*application/rdf\+xml.*
RewriteRule ^index.html$ http://www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.rdf [L]


RewriteRule ^index.html$ http://en.wikipedia.org/wiki/List_of_Mad's_movie_spoofs

The Apache web server where I have this hosted is configured to look for an index.html file in a directory if the requested URL doesn’t mention a specific filename, so the three rules here each modify that “request” to look for something else, depending on what the RewriteCond line finds in the HTTP_ACCEPT value. If it finds “text/turtle”, it sends the Turtle version of my data, and the L directive tells the Apache mod_rewrite module that is processing these instructions not to look at any more of them.

The next rule performs the corresponding HTTP_ACCEPT check and file delivery for an RDF/XML request, and the default behavior if neither of those happen is to deliver an HTML version of the data. (I took the lazy way out and just redirected to the appropriate Wikipedia page instead of creating a new HTML file.) As you can see from the two commented-out lines, I had the impression that adding R=303 in the brackets with the L would send an HTTP return code of 303 back to the requester, overriding the default code of 302, but never got that to work. If anyone has any any suggestions about how to fix this, or whether 303 is even the most appropriate return code, please let me know.

From what I’ve read on how the syntax of these instructions work, I shouldn’t have needed the full URLs for the Turtle and RDF/XML versions of the Mad Film Parody data, because they were in the same directory as the .htaccess file, but that was the only way I could get this to work.

Now that I know how to do this, I can do it again for other resources pretty quickly. It took me about five minutes to do it for the little http://www.snee.com/ns/madMag/MadFilmParody ontology that the data points to. I consider this solution quick and a bit dirty because it requires the maintenance of two copies of the data, but the XML guy in me knows that it would be wrong to perform parallel edits on the two copies, and that I should instead pick one as a master, edit it when necessary, and generate the other from it. If I had to do this on a larger scale, I learned from Brian Sletten at last year’s semtech that I should look into NetKernel, but it was a good exercise to do it this way to learn what was really going on.

I’m going to try to get into the habit of doing this for data and ontologies that I create, so I’d appreciate any suggestions about tweaking details before any suboptimal aspects of this become habits.

2 Comments

By Ryan on May 9, 2011 3:40 PM

To help maintain a master copy of your RDF and transform into other formats through the command line, I’d recommend the rdfcat utility distributed with Jena: http://jena.sourceforge.net/javadoc/jena/rdfcat.html . Personally, I’d make Turtle my master format language due to readability and file size, and transform that into XML after editing. Something like this:

java jena.rdfcat MadFilmParodies.ttl -in TTL > parody.rdf

By Bob DuCharme {.commenter-profile} on May 9, 2011 4:14 PM

Thanks Ryan! I’ve used jena.rdfcopy, but never noticed rdfcat before.

Data providers

Bob DuCharme — Mon, 02 May 2011 08:31:42 -0500

RDF or otherwise.

While beta testing Talis’s Kasabi, I got to wondering about the data publishing market: who out there is hosting raw data, potentially charging for it and passing money along to the data’s providers? Poking around, I learned who the key names are. (Corrections welcome.) I accidentally stumbled across a few more when I followed a tweet from @xmlgrrl (a.k.a. Eve Maler, a friend of mine in the XML world since it was the SGML world) and started looking at her husband Eli’s blog. His posting Ten services to get your cloud startup off the ground now mentioned a few more companies that provide raw data—one that even provides free RDF. I tagged a few with a delicious.com bookmark, but wanted to write out notes about a few here in order of how interesting they are to a semantic web geek.

Some general notes:

The more I studied, the more I found, but I didn’t want to spend more than an afternoon on this.
These sites all let you download data directly. I didn’t include sites like Data.gov that function more as directories that link to data sources on other sites.
Most of these providers have boosted their numbers of available datasets by including small datasets with as few as 100 records, and by hosting copies of data from the well-known names in the Linked Data Cloud. The advertised added value is typically the ease of programmatic access to that data.
Despite the title of this blog entry (I was tempted to call it “Data resellers”, but many make the data available for free) I focused on a more narrow case of data providers: the redistributors that gather data from specific, identified places and then make it available publicly with attribution, not actual data sources themselves such as government agencies, university projects, media making their metadata available, and various other circles on the Linked Data Cloud diagram.
If I’ve quoted some companies’ websites more than others, it’s because they had “About” and “FAQ” pages that were easy to find and actually answered the questions I was wondering about.

The most interesting thing about Kasabi in this field is their commitment to providing data according to Linked Data principles, giving you SPARQL endpoints for data sources and the ability to define new APIs around each data source. The current data selection is interesting, considering that Kasabi is still in beta. For now it all looks like data that is freely available elsewhere, but the advantages of retrieving it from them go beyond the ability to use the SPARQL query language. For example, with BestBuy’s RDFa spread out across many different dynamically generated pages on bestbuy.com, querying this data from BestBuy’s server has a lot of limitations. Kasabi seems to have the BestBuy data aggregated so that their customers have more flexibility in how they query it.

While disintermediation was a big buzzword of the dot com boom, intermediation is now getting bigger.

I list Socrata right after Kasabi because RDF is one of their export formats, along with XML, JSON, CSV, XLS, and more. In a business that depends on finding both data providers and data users, their home page makes the clearest case about why someone should work with them as a data provider: they’re clearly targeting government agencies who need to fulfill data transparency mandates. (Other providers are certainly targeting this market; just not as clearly.) The company info page calls them “The Leader in Open Data Services for Government”. Another paragraph on the homepage makes a nice case for why developers should be interested in their data, and upcoming webinar titles of “Launch your own Data.Gov” and “Open Data as a Service Delivery Platform” are also pretty catchy to someone interested in this market.

Factual targets data users more than data providers on their current home page, telling developers “Access great data for your web and mobile apps”. The only download format I could find was CSV, but with their emphasis on helping developers build apps, they focus more data delivery with their RESTful API. According to their FAQ, “Factual, Inc. is an open data platform for application developers that leverages large scale aggregation and community exchange… Factual’s hosted data comes from our community of users, developers and partners, and from our powerful data mining tools… Factual offers several hundred thousand datasets across a variety of topics (with a deep focus in Local) aggregated from multiple sources, made easily accessible for developers to build web and mobile apps… Our APIs are free to everyone—if you want SLAs or have certain performance requirements, we would charge you a fee based on usage volume. Our downloads are free for smaller developers”. A press release on Semantifi’s web site shows that some big names and big money are behind Factual.

Infochimp seems to be one of the more well-known (and memorable) names in the field. From their FAQ: “Infochimps is a place for people to find, share and sell formatted data. Both users and Infochimps employees scrape, parse and format data so that it’s easily accessible to you. We take the chimp work out of working with data so you can literally start building cool stuff in minutes… There is no sign up fee to use Infochimps. Some of the data sets available on our site are free. Some require attribution, and others are available for purchase. The first 100,000 data API calls are free. We offer subscriptions if you would like to use more… The data sets available through our API are 1.) hosted for you and 2.) scraped on a regular basis. … Most of our data comes in tsv, csv or yaml format”. The part about users scraping, parsing, and formatting highlights another aspect of the business model of some of these companies: crowd-sourcing the labor whenever possible.

AggData sells CSV files, typically of locations of all the stores in a particular chain. For example, a complete list of Cinnabon locations, with 454 records, costs $29. The description page for each data set lists the fields and lets you download a sample. Prices that I saw ranged from $9 to $49. According to their FAQ, you order a dataset, and when payment is confirmed they email you a URL for the data that is good for 5 downloads or 120 hours. Being founded in 2006 and therefore the oldest of these companies, AggData is the most low-tech (no APIs here) but it’s a lot easier to look at their lists of franchise locations and churches and imagine that data being useful to someone than it is for many of the other data providers. Infochimps lists AggData as a “featured data provider”, but lists the same prices for the same datasets, so I’m not sure whether they’re just routing you to the same batches of data or making it available through their own APIs. (I got an Infochimps ID, clicked through for an AggData dataset until it asked me for credit card information, and stopped there.)

According to their About page, Semantifi “developed a meaning based search platform to search both structured and unstructured content and filed multiple patents”. Along with the platform, they say that they have an “App Store like marketplace for a community of publishers to build data search apps” and that “Both Socrata and Factual are quite similar in concept and both lack the technology to search datasets like Semantifi”. As far as I could tell, Socrata and Factual have a lot more datasets than Semantifi; the first three Semantifi links that I clicked to look into specific data sets went to an empty wiki page. (If I was clicking in the wrong place, that’s not a great reflection on their site design. Also, with all of the people with hardcore financial markets experience on Semantifi’s management page, why they need Google ads on their home page?) Perhaps Semantifi is less like data providers Socrata and Factual then they think and more like Open Data Directory, which doesn’t provide actual data but instead a search engine for data spread out across other sites that they index.

I wanted to mention one other interesting source of fairly large-scale data to use in applications—when I learned how to add a volume for more disk space to an Amazon EC2 cloud image, I found that some of the volumes I could choose from included data from a choice of public data sets: DBpedia and Freebase dumps, the Enron email, US Census, Labor, and economic data, various biological data collections, and more. There is a list of such data on Amazon’s website, but doesn’t show all the choices; additional data sets include BBC Music and programs data. If you were going to jump into the data reseller market with the various companies described above, an Amazon image with some of this data would be one logical place to start your company.

A local friend Eric Pugh was recently pointing out to me the irony of how, while disintermediation was a big buzzword of the dot com boom, intermediation is now getting bigger. These data resellers are a good example. If you’re going to insert yourself as a middleman between a data provider and a data user, it’s a compelling case for either side to use your service if you have a lot of customers on the other side, but before you get there, you need to make your own compelling case to each side. Some of the companies listed above are better at doing this than others, and it will be interesting to see which of them are in business in five years and why they lasted.

3 Comments

By Bill Roberts on May 2, 2011 9:06 AM

Hi Bob

If I may add ourselves to your list, as another company working in the data publishing market, Swirrl has a product PublishMyData. Rather than being a data aggregator or marketplace, we are aiming at enabling the data owners to publish it themselves as Linked Data. (We only do Linked Data in the full RDF/SPARQL sense of the term).

At the moment we’re offering a hosted/full-service approach, but will be introducing do-it-yourself options in future. We’re currently concentrating on the public sector and on open data - so the data is all free to use and the business model is that the data owner pays to publish. So the data provided via our service could be used directly, or could be picked up and re-offered through one of these intermediary sites.

But our philosophy is that the best way to get high quality Linked Data online (and so highly re-usable data) is for the data owners to take responsibility for doing it themselves.

Cheers

Bill

By Bob DuCharme {.commenter-profile} on May 2, 2011 7:27 PM

Thanks Bill, looks very cool!

By Chris Hathaway on May 2, 2011 8:16 PM

Hey Bill,

I’m the CEO & Founder of AggData. Thanks for the mention in your overview; it’s a great review. I did want to clear up a few uncertainties for our description. First, you are correct, Infochimps is currently just a reseller of our data, though we may integrate further with them in the future. And while our main aggdata.com site is pretty straightforward so people can easily get the data they are looking for, we have some broader and more technically involved options for interested clients, and we’re planning a public launch of an API very soon. Overall, it’s great to see the field growing and the excitement around providing quality data.

Thanks,
Chris

Inserting data from a SPARQL endpoint into a relational database

Bob DuCharme — Wed, 27 Apr 2011 09:28:18 -0500

Via XML.

Retrieval of triples from relational databases is a popular topic in the semantic web world, but I was recently wondering how much trouble it would be to go in the opposite direction: to retrieve data from a SPARQL endpoint and load it into a relational database. It wasn’t much trouble at all. When you retrieve the results in the SPARQL query results XML format, a straightforward XSLT spreadsheet can convert it into the necessary SQL INSERT statements. I was able to automate the data retrieval, conversion to INSERT statements, and actual insertion into a MySQL database with a three-line batch file that used no Windows-specific tricks, so I’m sure it would work on Linux just as well.

I used the following SPARQL query to retrieve the name, founding year, and equity, revenue, net income, and operating income figures of companies listed on the New York Stock Exchange according to DBpedia. I used ARQ to execute the query, so that after the inner query retrieved the raw data from the http://DBpedia.org/sparql SPARQL endpoint service, the outer query could use ARQ’s SPARQL 1.1 support to format the data a bit—mostly, by using the str() function to strip language and datatype tags.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX do: <http://dbpedia.org/ontology/>


SELECT (str(?name) as ?coName) 
       (substr(str(?formationYearTyped),1,4) as ?formationYear)
       (str(?equityTyped) as ?equity) 
       (str(?revenueTyped) as ?revenue) 
       (str(?netIncomeTyped) as ?netIncome) 
       (str(?operatingIncomeTyped) as ?operatingIncome) 
  WHERE {
  SERVICE <http://DBpedia.org/sparql>
  {
    SELECT * WHERE {
     ?company <http://purl.org/dc/terms/subject> 
     <http://dbpedia.org/resource/Category:Companies_listed_on_the_New_York_Stock_Exchange> .
     ?company rdfs:label ?name . 
    FILTER ( lang(?name) = "en" )
      OPTIONAL { ?company do:formationYear ?formationYearTyped . } 
      OPTIONAL { ?company do:equity ?equityTyped . }
      OPTIONAL { ?company do:revenue ?revenueTyped . } 
      OPTIONAL { ?company do:netIncome  ?netIncomeTyped . } 
      OPTIONAL { ?company do:operatingIncome ?operatingIncomeTyped . } 
    }
  }
}

The following command line told ARQ to put the results of this query in an XML file called companyData.xml. (Because the query doesn’t have the FROM keyword, ARQ needs an input dataset specified, so the command names dummy.ttl as this input even though the query above ignores this file and gets its data from DBpedia using the SERVICE keyword.)

arq --results XML --query getCompanyData.spq --data dummy.ttl > companyData.xml

Next, I ran the following command to apply an XSLT stylesheet to the result of the ARQ output using libxslt’s xsltproc XSLT processor. (You could use Saxon or Xalan just as easily.) This generated the SQL statements that would add the data to a MySQL database and stored them in the file insertCompanData.sql:

xsltproc SPARQLXMLtoSQL.xsl companyData.xml > insertCompanyData.sql

The XSLT stylesheet is not particularly brief, but there’s no customized logic to process the output of the query above other than the use of the query’s variable names and the quotes that it adds around the coName values. (The potential need for quotes depends on whether you’re inserting the value into the SQL database as a string.) The trickiest part was having the stylesheet output the string “NULL” when a value was missing; I used a named template, so it wasn’t too tricky.

If I had many different query results to convert to SQL INSERT statements, I’d write a more generalized version of this stylesheet (for example, setting the the name of the database and table to receive the data in variables at the top), but if I only had two or three sets of SPARQL query results to deal with, I could adapt this one for each of those pretty quickly:

<xsl:stylesheet version="1.0"
                xmlns:s="http://www.w3.org/2005/sparql-results#"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">


  <xsl:strip-space elements="*"/>
  <xsl:output method="text"/>




  <xsl:template match="s:sparql">
    USE testdb;
    <xsl:apply-templates/>
  </xsl:template>




  <xsl:template match="text()"/> <!-- all values output with xsl:value-of -->




  <xsl:template match="s:result">


  <!-- Typical line for this template rule to create 
       (with carriage return added here):
       INSERT INTO company VALUES(
       "Protective Life",1907,NULL,3.06E9,2.71E8,4.16E8);
   -->
    <xsl:text>INSERT INTO company VALUES("</xsl:text>
    <xsl:value-of select="s:binding[@name='coName']/s:literal"/>
    <xsl:text>",</xsl:text>


    <xsl:call-template name="valueOrNULL">
      <xsl:with-param name="value"
                      select="s:binding[@name='formationYear']/s:literal"/>
    </xsl:call-template>
    <xsl:text>,</xsl:text>


    <xsl:call-template name="valueOrNULL">
      <xsl:with-param name="value"
                      select="s:binding[@name='equity']/s:literal"/>
    </xsl:call-template>
    <xsl:text>,</xsl:text>


    <xsl:call-template name="valueOrNULL">
      <xsl:with-param name="value"
                      select="s:binding[@name='revenue']/s:literal"/>
    </xsl:call-template>
    <xsl:text>,</xsl:text>


    <xsl:call-template name="valueOrNULL">
      <xsl:with-param name="value"
                      select="s:binding[@name='netIncome']/s:literal"/>
    </xsl:call-template>
    <xsl:text>,</xsl:text>


    <xsl:call-template name="valueOrNULL">
      <xsl:with-param name="value"
                      select="s:binding[@name='operatingIncome']/s:literal"/>
    </xsl:call-template>


    <xsl:text>);&#10;</xsl:text>
  </xsl:template>

 


  <xsl:template name="valueOrNULL">
    <xsl:param name="value"/>
    <xsl:choose>
      <xsl:when test=" $value != '' ">
        <xsl:value-of select="$value"/>
      </xsl:when>
      <xsl:otherwise>NULL</xsl:otherwise>
    </xsl:choose>
  </xsl:template>




</xsl:stylesheet>

To run the created INSERT statements with a MySQL database table, I just did this, substituting my own MySQL username and password:

mysql -u myusername --password=mypass < insertCompanyData.sql

Of course, the created set of INSERT statements assumes that a database named testdb with a table named company already exists, and that the appropriate columns have been declared for that table.

After combining the command line calls to arq, xsltproc, and mysql in a three-line batch file, it was fun to see it all happen unattended. For a more serious implementation, you’d want to look into the use of APIs to the various tools as a more efficient alternative to this kind of scripting, but it’s nice to see how much can be done with a little scripting.

Form-driven SPARQL queries without scripting

Bob DuCharme — Wed, 20 Apr 2011 08:39:27 -0500

Just two lines in an .htaccess file.

In a podcast of a radio show I was listening to recently, the host asserted that 80s rapper Schoolly D had scored most of director Abel Ferrara’s films. I was curious about this, so I went to IMDB’s page for Ferrara, clicked on the first film title, scrolled down, clicked “Full cast and crew”, checked the music credit, returned to Ferrara’s main page, and repeated the last few steps… until I realized that one SPARQL query could create a single list of Ferrara’s films with the film score credit next to each one.

The following query, when entered on DBpedia’s snorql form, shows that Mr. D is credited with two films, and that Joe Delia is credited with many more:

SELECT ?title ?scorer WHERE 
{   
  ?director rdfs:label "Abel Ferrara"@en .    
  ?film <http://dbpedia.org/ontology/director> ?director .    
  ?film rdfs:label ?title .   
  FILTER ( lang(?title) = "en" )   
  ?film <http://dbpedia.org/property/music> ?scorer .  
} 
ORDER BY ?scorer

(Further research showed that Delia brought in D to contribute to many of the films for which he is credited. Also, I could have done this with the Linked Movie Database SPARQL endpoint, as I’ve written about before, but I’ve been exploring DBpedia’s film data more lately.)

A great way to spread the benefits of SPARQL and semantic web data while keeping the syntax parts under the covers is to create a web form for users to fill out and to insert the entered values into a SPARQL query. I thought that a form where you enter a director’s name and then see who scored his or her films would be a nice example of this. In the IBM developerWorks article Build Wikipedia query forms with semantic technology, I described and linked to two such forms; the first listed all the actors who appeared in movies by the two directors whose names you entered in the form (for example, everyone who appeared in films by both Woody Allen and Martin Scorsese), and the other searched album and artist names for strings of text and displayed basic information about the albums it found.

Both of those forms passed the entered values to python scripts that plugged the values into SPARQL queries before sending these queries off to the appropriate SPARQL endpoints. Recently, though, while reading Tom Heath and Christian Bizer’s book Linked Data: Evolving the Web into a Global Data Space, I had a better idea. I’ve used .htaccess files to redirect an Apache HTTP server from one requested URL to another (for example, when I’ve moved a file but don’t want to break links that point to it) but I didn’t know about the regular expression support in the Apach mod_rewrite module that carries out the .htaccess instructions. It turns out that, because of this feature, I don’t even need a script to execute a SPARQL query with values from a web form.

A form that I put at http://snee.com/sparqlforms/directors/filmscores.html has a single field where you enter a director’s name. When you click the “go” button, the form’s action is http://www.snee.com/sparqlforms/directors/composers, so if you enter “John Ford” the form does an HTTP GET with the URL http://www.snee.com/sparqlforms/directors/composers?director=John+Ford.

The .htaccess file in the same directory has the following three lines (everything from “RewriteRule” to the end is one line, split up for easier viewing here):

RewriteEngine on


RewriteCond %{QUERY_STRING} ^director=(.*)$


RewriteRule ^composers.*$ http://dbpedia.org/sparql?query=
PREFIX+rdfs:+<http://www.w3.org/2000/01/rdf-schema#>+
SELECT+?title+?scorer+WHERE+{+?director+rdfs:label+"%1"@en+.+
?film+<http://dbpedia.org/ontology/director>+?director+.+
?film+rdfs:label+?title+.+FILTER+(+lang(?title)+=+"en"+)+
?film+<http://dbpedia.org/property/music>+?scorer+.++}+ORDER+BY+?scorer

Most of the third “line” is just an escaped version of the SPARQL query about who scored Abel Ferrara films. I won’t go into details about the syntax of the rest of the three lines because this tutorial explains the basics better than I could and this bit of Apache documentation is pretty comprehensive.

To summarize, RewriteRule gets two expressions as arguments: what to look for and what to replace it with when redirecting your browser or other client. Regular expression matching in the first parameter can use parentheses, and the second expression can refer to these matched expressions with variable references like $1 and $2. HTTP GET parameters like “?directory=John+Ford” are a special case, though—RewriteRule regular expressions won’t find them—which is why I have the RewriteCond line above. That matches the director value parameter, and the RewriteRule references that with %1 (as distinguished from $1, which would reference something matched in the RewriteRule). This inserts the value into the escaped version of the SPARQL query where I had “Abel Ferrara” in my original query. The query is part of a URL that executes the query on DBpedia’s endpoint, so the user who clicks “go” on the form will see the list of film titles and music credits. Try the form yourself, and make sure to use a director’s official name (for example, “Marty Scorsese” won’t get you anything).

This kind of URL revision is an important technique in Linked Data publishing, where you want to assign sensible, cool URIs to resources but may have some less cool details in how you actually serve up the resource data. For a larger, more complex application, it’s nice to know that I would only need to add two more lines to the .htaccess file for each new form/query combination in my application. This can be a very valuable tool for semantic web application development. (I couldn’t get it to work with a local copy of the Apache HTTP server or with the Url Rewrite Filter designed to allow the same thing with Tomcat, though, so I may have to go back to the python CGI scripts for local applications.)

5 Comments

By Matthew on April 21, 2011 6:04 AM

Hi Bob,

The first time I heard a Schooly D song in a Ferrara movie (King of New York) I flipped :)

I’ve been going through your post from 2007 on querying DBpedia and see that the chalkboard query no longer works in snorql. Could you tell me why this is?

thanks for an informative blog!\

By Bob DuCharme {.commenter-profile} on April 21, 2011 8:36 AM

Matthew,

DBpedia has rearranged some of their vocabulary. I just fixed the queries and description in that post so that the query now works properly.

By Vasiliy Faronov on April 21, 2011 12:15 PM

And the user can’t pass a double quote into that string, right?

By Bob DuCharme {.commenter-profile} on April 21, 2011 12:30 PM

Vasiliy,

I didn’t try that, but it makes sense.

For more fine-grained control over things like that, a script probably would be better, but a regex guru might be able to work it right into the .htaccess code.

By amit on April 21, 2011 11:35 PM

Nice read. Please Try my webapp for querying the semantic data: http://WWW.s3space.com\

Getting started with SPARQL Update

Bob DuCharme — Mon, 04 Apr 2011 09:08:09 -0500

Using Fuseki.

I’ve described in earlier postings ([1],[2]) how I mostly use Jena ARQ to play with SPARQL 1.1 queries. To try out the new SPARQL Update commands, I wanted to use a simple triplestore where I could add, replace, and delete triples, and Jena Fuseki has turned out to be a very simple way to do this.

Unzipping the file that I downloaded from the Fuseki release 0.2 development snapshot created a directory with a jar file, some shell scripts, and a few other files. The shell scripts included with this zip file are all Linux-oriented, but looking at them I figured out how to start up the Fuseki server in Windows easily enough. Everything shown below worked in both Windows XP and Ubuntu.

Running this command in the Fuseki directory lists your options:

java -jar fuseki-sys.jar --help

The following command line worked great for me to start up the Fuseki server with a 1200 meg Java heap space, a figure I saw in the fuseki-server shell script included with the zip file. It allows users of the server to update data, stores data in a TDB database in the dataDir subdirectory that I created in the Fuseki directory before running this command, and selects the myDataset dataset:

java -Xmx1200M -jar fuseki-sys.jar --update --loc=dataDir /myDataset

After starting the server up, you can send your browser to the main Fuseki screen at http://localhost:3030. Click its Control Panel link, then click the Select button to pick the /myDataset dataset, and you’ll be on the Fuseki Query screen:

To insert a bit of data into the triplestore, paste the following into the box in the SPARQL Update part of the form and click the Perform update button under the box:

PREFIX d:    <http://example.com/ns/data#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


INSERT DATA
{ 
  d:i1 rdfs:label "one" . 
  d:i2 rdfs:label "two" . 
}

Fuseki will show a screen saying that your “Update succeeded”. Click your browser’s Back button to return to the main Fuseki screen.

Next, enter the following classic query in the SPARQL Query part of the Fuseki Query form and click the Get Results button there:

SELECT *
WHERE { ?s ?p ?o }

Fuseki will show you all the triples in your triplestore’s default graph—both of them. Click your Back button again, and Fuseki’s query form will still have the two queries you entered. You can edit them, comment out certain lines, and do whatever you like as you experiment with inserting, changing, and deleting data in the triplestore.

Because of the command line options that I used when starting up Fuseki above, the data persists. If go to the command line window where you started up the Fuseki server and press Ctrl+C to shut it down, then start it up again and try the same SELECT query above, you’d see that the data is still there.

There’s a lot more you can do with Fuseki, but the steps above were all I needed to create an environment where I could try out the commands described in the SPARQL Update spec. As far as I can tell, all the important parts of the spec work in Fuseki, so it’s a fine way to get to know this great new addition to SPARQL.

1 Comments

By Scott Henninger on April 6, 2011 3:19 PM

Also see http://topquadrantblog.blogspot.com/search/label/SPARQL%20endpoint

From a Wikipedia page to the corresponding DBpedia page in one click

Bob DuCharme — Fri, 25 Mar 2011 09:09:27 -0500

And a similar link for Freebase.

The following two links won’t do much if you click them now, but if you drag them to your bookmarks toolbar, clicking the first one there while viewing a Wikipedia page will take you to the corresponding DBpedia page, and clicking the second while viewing the Freebase page for a particular topic will take you to the page full of RDF for that topic.

wp -> dbpedia

freebase rdf

They’re both scriptlets, or little bits of Javascript code embedded in links. Each reads the URL of the currently displayed page, does a bit of string manipulation on it, and sends your browser to the resulting URL. I’ve had the DBpedia one for a while, but I recently found that DBpedia URLs escape parentheses when Wikipedia URLs don’t, so I fixed the scriptlet to account for this. While I was at it, I created the Freebase one, which is much simpler.

If anyone’s interested, I also have scriptlets to go to a site’s home page, an “up” button (cd ..), and a backlink button that searches Google for webpages linking to the currently displayed one.

4 Comments

By Tom Morris on March 27, 2011 12:43 PM

Your Freebase script doesn’t look like it’s using the proper namespace. There’s no guarantee that there will be a key in the /en namespace and, if there is, it may only be loosely related to the name of the Wikipedia article. You should reference the /wikipedia/en namespace and make sure the key is properly quoted.

See my WTF (Wikipedia-to-Freebase) Chrome extension or Zak Dweil’s Greasemonkey script for a way to do this.
https://chrome.google.com/extensions/detail/hgmjdmegeidmljpoilgmfeifmiepnbkn

By Bob DuCharme {.commenter-profile} on March 27, 2011 12:49 PM

it may only be loosely related to the name of the Wikipedia article

The idea was to go from the Freebase article to the Freebase page, not from the Wikipedia article to the Freebase page. I was just taking advantage of the commonality I found in URIs when I clicked the RDF links at the bottom of Freebase pages.

By Amit on April 10, 2011 11:50 AM

This is very useful for creating SPARQL queries. I created a site s3space.com several months ago where users can learn, create and share the SPARQL queries.

This conversion tool will help many to create more SPARQL queries easily.

Thanks for awesome tool.

Regards,
Amit
http://www.s3space.com

By Bob DuCharme {.commenter-profile} on April 21, 2011 12:32 PM

Thanks Amit, s3space.com looks pretty cool.

Pulling SKOS prefLabel and altLabel values out of DBpedia

Bob DuCharme — Tue, 22 Feb 2011 09:12:28 -0500

Or, using linked data to build a standards-compliant thesaurus with SPARQL.

When my TopQuadrant colleague Dean Allemang referred to the use of DBpedia as a controlled vocabulary, I said “Huh?” He helped me to realize that if you and I want to refer to the same person, place, or thing, but there’s a chance that we might use different names for it, DBpedia’s URI for it might make the best identifier for us to both use. For example, if you refer to the nineteenth-century American president and Civil War general Ulysses S. Grant and I refer to him as Ulysses Grant, and then we find out that DBpedia’s URI for him is http://dbpedia.org/resource/Ulysses_S._Grant, I’m not going to insist on leaving Grant’s middle initial out of the URI.

Grant once had the nickname “Useless S. Grant”, and DBpedia can help us here, too. If you try to go to a Wikipedia page for http://en.wikipedia.org/wiki/Useless_S._Grant, instead of sending you an error message, Wikipedia will redirect you to the http://en.wikipedia.org/wiki/Ulysses_S._Grant page. DBpedia uses the http://dbpedia.org/ontology/wikiPageRedirects property to track these redirect values, and a SPARQL query that uses it can list alternative names for things that have Wikipedia entries.

I can use this and one of DBpedia’s Categories pages to drive a SPARQL query that selects preferred and alternative labels for a group of DBpedia entries at once. If you enter the following query on DBpedia’s snorql form, it will give you a list of the preferred names of all the 19th-century presidents of the United States, as well as other names they might be known by.

SELECT ?prefLabel ?altLabel 
WHERE 
{
  ?president dcterms:subject 
   <http://dbpedia.org/resource/Category:19th-century_presidents_of_the_United_States> ; 
         rdfs:label ?prefLabel  . 
   ?nickname <http://dbpedia.org/ontology/wikiPageRedirects> ?president ; 
         rdfs:label ?altLabel . 
   FILTER ( lang(?prefLabel) = "en" )
   FILTER ( lang(?altLabel) = "en" )
}

The variable names I used will give SKOS fans a clue where I’m going with this: the creation of SKOS triples from this data. The following variation on the SELECT query above declares that the URL for each president on the list of 19th century presidents is a skos:Concept, and it then assigns skos:prefLabel and skos:altLabel values based on the same logic used in the query above.

CONSTRUCT 
{
  ?pres a skos:Concept;
        skos:prefLabel ?prefLabel ;
        skos:altLabel ?altLabel . 
}
WHERE 
{
  ?pres dcterms:subject 
   <http://dbpedia.org/resource/Category:19th-century_presidents_of_the_United_States> ; 
        rdfs:label ?prefLabel . 
   ?alt <http://dbpedia.org/ontology/wikiPageRedirects> ?pres; rdfs:label ?altLabel . 
   FILTER ( lang(?altLabel) = "en" )
   FILTER ( lang(?prefLabel) = "en" )
 }
}

When running this query with DBpedia, it creates 300 triples. These include skos:altLabel values such as “The Great Emancipator” and “Abe Lincoln” for Abraham Lincoln (or rather, for the concept http://dbpedia.org/resource/Abraham_Lincoln, which has a skos:prefLabel of “Abraham Lincoln”) as well as popular misspellings such as “Abraham Linkin” and “Presedent Lincon”. (If I was going to use this in a production application, I’d change the skos:altLabel values based on misspellings to skos:hiddenLabel values.)

It’s nice how a single query can pull data from DBpedia to populate a SKOS-based thesaurus with preferred and alternative labels. It makes a nice example of how SPARQL can add value (in this case, by redoing the data to conform to a specialized standard) from linked data.

2 Comments

By Danny on February 23, 2011 5:46 PM

Nice one Bob! Be sure and post anything else you might have on (quasi-) extracting domain vocabs from DBpedia. Seems a lot better than making up names/URIs from scratch - not only because there will be linkage already in place, but also it’ll save loads of work in looking for synonyms etc.

But what I really want to know - what is that guitar he’s playing!? Does seem to suit his name (and hairstyle).

By Bob DuCharme {.commenter-profile} on February 23, 2011 6:00 PM

Hi Danny, and thanks! No idea about the guitar; I just found that with some searches.

What SKOS-XL adds to SKOS

Bob DuCharme — Tue, 08 Feb 2011 07:47:22 -0500

Extra flexibility for label metadata.

In my first few glances at SKOS eXtension for Labels, I didn’t quite get it. Recently, though, while looking at a client’s requirements document at TopQuadrant, when I saw that they wanted to attach metadata to individual terms, I started modeling this in my head and then I realized I didn’t need to: SKOS-XL already had.

"Any problem in Computer Science can be solved by another level of indirection."

First, why can’t you attach metadata to specific terms with the base SKOS standard? Because although SKOS is an ontology for managing controlled vocabularies (and taxonomies, and thesauri), the basic unit of what it manages is not a term, which is what taxonomy management software always managed before. This is a Good Thing, because it makes internationalized vocabularies much easier to manage. I can have a single concept with a German preferred label of “Spirituosen”, a British English preferred label of “spirits”, an American English preferred label of “liquor”, and an American alternative label of “booze”, and they all refer to the same concept. The United Nations Food and Agriculture Organization’s AGROVOC thesaurus is a good example of this practice, with dozens of preferred and alternate labels for some concepts.

SKOS’s extensibility means that you can attach all the metadata you want to a particular concept, but not to one of the terms defined as labels for that concept. This is because, being labels, they’re strings. (In spec talk, they’re “lexical entities”, which isn’t quite the same thing, but close enough for our purposes.) SKOS is built on RDF, and in RDF triples strings can only be the objects of triples, not the subjects. So how can we assign metadata about the labels themselves, such as the name of the person who added a particular label, or the date it was last updated?

The Cambridge computer scientist David Wheeler, who in 1951 became the first person ever to complete a PhD in the field, once said “Any problem in Computer Science can be solved by adding another level of indirection”. That’s what SKOS-XL does: it defines variations on the SKOS skos:prefLabel and skos:altLabel properties called skosxl:prefLabel and skosxl:altLabel (assuming, as always, that these prefixes have been properly declared). Instead of having strings as their values, these extension properties point to members of the skosxl:Label class. Members of this class have a skosxl:literalForm property to identify a string that serves as a label for the concept, and it can have all the additional properties you want.

The following shows some Turtle syntax for a SKOS-XL representation of the concept described above, with extra :lastEdited and :myCustomProperty properties adding metadata to some of the labels:

@prefix skos:   <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix :       <http://www.example.com/demo#> .
@prefix rdf:    <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .


:concept234 rdf:type skos:Concept ;
  skosxl:prefLabel :label1 ;
  skosxl:prefLabel :label2 ;
  skosxl:prefLabel :label3 ;
  skosxl:altLabel  :label4 .


:label1 rdf:type skosxl:Label ; 
  :lastEdited "2011-02-05T10:21:00"^^xsd:dateTime ;
  skosxl:literalForm "Spirituosen"@de .


:label2 rdf:type skosxl:Label ; 
  :lastEdited "2011-02-05T10:28:00"^^xsd:dateTime ;
  :myCustomProperty 2.71828 ;
  skosxl:literalForm "spirits"@en-GB .


:label3 rdf:type skosxl:Label ; 
  :lastEdited "2011-02-05T10:34:00"^^xsd:dateTime ;
  skosxl:literalForm "liquor"@en-US .


:label4 rdf:type skosxl:Label ; 
  :lastEdited "2011-02-05T10:42:00"^^xsd:dateTime ;
  :myCustomProperty 3.1415 ;
  skosxl:literalForm "booze"@en-US .

The general idea is pretty elegant, and having a standardized way to do it prevents me and others from developing our own variations that do the same thing. I’m glad I didn’t take that model in my head too far.

How much use of SKOS-XL have people seen in the real world?

4 Comments

By fps on February 8, 2011 5:09 PM

Hmm… nothing states that we can deduce the statements involving the standard skos props?

“All problems in computer science can be solved by another level of indirection… Except for the problem of too many layers of indirection.” - or too many indirections to deal with.

So, couldn’t we have a global mechanism that would work for any property whose range is rdfs:Literal ? Something around :

exlit:ExtendedLiteral a rdfs:Class.

exlit:extendedLiteral
a rdfs:Property;
rdfs:range ExtendedLiteral.
dc:comment “from a statement using this property, one can deduce the statement whose subject is this statement’s subject , whose predicate is this statement’s object’s exlit:property value, and whose object is statement’s object’s exlit:value .”

exlit:property a rdfs:Property;
rdfs:domain exlit:ExtendedLiteral;
rdfs:range rdfs:Property;
dc:comment “the ’literal property’"@en.

exlit:value a rdfs:Property;
rdfs:domain exlit:ExtendedLiteral;
rdfs:range rdfs:Literal;
dc:comment “the literal value”@en.\

So, from:
ex:Me extendedLiteral ex:MyGivenName.
ex:MyGivenName a ExtendedLiteral;
exlit:property foaf:givenName;
exlit:value “François-Paul”;
dc:comment “My parents first thought of calling me François, like one if my grand fathers, but they resolved to add Paul like the other one, for having them both happy”.

You could deduce ex:me foaf:givenName “François-Paul”.

Does it make sense ?

Best

By Bob on February 8, 2011 9:43 PM

Sure, I guess. I’m just happy that a way to address this client’s needs didn’t require new modeling and non-standardized extensions, like I originally thought it would.

By masaka on February 8, 2011 10:08 PM

Hi,

NDLSH (National Diet Library Subject Heading) uses SKOS-XL heavily to give ‘yomi’ (transcription) to each label. See for example,

http://id.ndl.go.jp/auth/ndlsh/00574798.ttl\

By Bob on February 9, 2011 8:55 AM

Thanks Masaka, those are very interesting.

More streamlined communication

Bob DuCharme — Thu, 06 Jan 2011 09:49:27 -0500

Teenagers now, Taylorites in 1910.

In an ITConversations podcast interview with Tim O’Reilly and John Batelle, Mark Zuckerberg describes a recent conversation with a teenage relative of his girlfriend. Those of us with teenage kids know that they consider email a bit old-fashioned, and this girl explained to Zuckerberg why: because it’s so slow. He was puzzled, thinking that email is practically instantaneous; why was it slow? Because, the girl replied, it’s slow to create a message. You look up someone’s email address, your write out a subject line, you start your message with some sort of salutation, then you write it, then you sign off at end, and so forth.

Obviously, as The Facebook Guy, Zuckerberg is pretty tuned in to how modern teenagers communicate, and he was telling this story to describe the motivation behind whatever Facebook’s latest spin on IM is. The story got me thinking back 100 years, though (or 101, now that it’s 2011).

JoAnne Yates’ excellent 1989 book Control through Communication: The Rise of System in American Management covers part of a topic that I’ve been interested in for a while: the change in information management that must have accompanied the industrial revolution. The factories making all those new things had to efficiently keep track of the what they made and the parts that went into it if they wanted to make a profit selling those things. Yates’ book covers several things that could be considered early content management, and Zuckerberg’s story reminded me of one part in particular. To quote her book,

Further changes in form were designed to make internal correspondence cheaper and more efficient to type, handle, and file. Writing in 1910 about what he called “interhouse correspondence,” or correspondence between different locations of a single company, one author recommended several changes in form that would make these documents look less like letters and more like present-day memos. His discussion is worth quoting at length, for it sheds light on the underlying reasons for the changes.

In the first place, all unnecessary courtesy, such as “Fred Brown & Co.,” “Gentlemen,” “yours very truly,” and other phrases are omitted entirely. In a business where hundreds and sometimes thousands of interhouse letters are written daily the saving of time is considerable. Next, an expensive letterhead is done away with, and this also is a factor in reducing expense. The blank is made with simply the words, “From Chicago,” “From Atlanta,” or whatever may be the name of the town where the letter is written, printed in the upper left-hand corner, and underneath the word, “Subject.”

The 1910 quote also recommends that internal letters include a serial number and that one letter replying to another should reference its serial number, or as I prefer to think of it, include a link to its unique ID. The book goes on to describe the origins of the memorandum—in later years, “memo”—which dispensed with the flowery niceties of traditional 19th-century correspondence because, in communication within a company, efficiency was more important than politeness conventions. Putting the message’s subject, date, and sender and recipient’s names in what we would now call a fielded metadata header made the information easier to digest, file, and receive. (Elsewhere, the book covers a 1902 recommendation that for easier filing and retrieval a piece of internal correspondence should cover no more than one topic—a century before DITA and over 60 years before Information Mapping.) The name of Frederick Taylor, who Dan Brickley mentioned in his New Years blog posting, comes up often in Yates’ book as a big influence on this thinking in general and Du Pont’s operations in particular.

On the one hand, the way kids skip what they see as extraneous information seems to continue this trend. On the other hand, the things that I like about email that the kids don’t care about are the things that the Taylorites developed to help manage that content: clearly marked fields of information to make it easier to archive and retrieve the memos.

Either way, it’s always interesting to look at long-term trends in information management by looking earlier than 1970, which computer scientists typically consider to be the stone age. I’d love any suggestions about related reading on the topic of information management during the industrial revolution.

2 Comments

By Gunnar on January 7, 2011 5:50 AM

On the other hand, the things that I like about email that the kids don’t care about are the things that the Taylorites developed to help manage that content: clearly marked fields of information to make it easier to archive and retrieve the memos.

Surely they also like them - at least if they ever have to find a piece of communication again at a later stage. However, they are boring to enter manually, because you know the system could easily enter if for you. After you clicked “send message” on some facebook page only the subject requires minor thought to fill in - and this could be “fixed” even without being (very) clever by just taking the first line (like Word suggests the filename when you save a new document)

I.e. what they do not like is Connolly’s Bane - probably just like you :)

By Dan Brickley on January 11, 2011 10:12 AM

Taylor would’ve loved Amazon’s Mechanical Turk. And see
http://behind-the-enemy-lines.blogspot.com/2010/12/excerpts-from-communist-manifesto.html for some rather timely Marx quotes…

What REST is really about

Bob DuCharme — Fri, 19 Nov 2010 08:57:18 -0500

According to the primary source document.

I had thought that “RESTful” meant “easily accessible with an HTTP GET, even when something isn’t HTML”. Shortly after a RESTafarian pointed out that there was more to it than that, I went to Brian Sletten’s excellent presentation REST: Information Architecture for the 21st Century at the Semantic Technologies conference and I learned a lot more about what being RESTful implies. During the presentation I asked Brian whether Roy Fielding’s 2000 doctoral thesis that originally laid out what REST was all about was readable, for a PhD thesis, and he assured me that it was.

Anyone with a basic understanding of software architecture issues can and should read Fielding's thesis.

He was right. Anyone with a basic understanding of software architecture issues can and should read Fielding’s thesis. I wish I’d read it years ago. I’ve copied a few nice quotes here, starting with this:

Software architecture research investigates methods for determining how best to partition a system, how components identify and communicate with each other, how information is communicated, how elements of a system can evolve independently, and how all of the above can be described using formal and informal notations.

What the acronym “Representational State Transfer” really means (emphasis mine):

REST components perform actions on a resource by using a representation to capture the current or intended state of that resource and transferring that representation between components. A representation is a sequence of bytes, plus representation metadata to describe those bytes. Other commonly used but less precise names for a representation include: document, file, and HTTP message entity, instance, or variant.

It can select from a choice of representations; I now better appreciate the important role of content negotiation in REST:

This abstract definition of a resource… provides generality by encompassing many sources of information without artificially distinguishing them by type or implementation [and] allows late binding of the reference to a representation, enabling content negotiation to take place based on characteristics of the request. Finally, it allows an author to reference the concept rather than some singular representation of that concept, thus removing the need to change all existing links whenever the representation changes (assuming the author used the right identifier).

With all the talk now of REST interfaces to services that are not necessarily delivering hypertext documents, it’s interesting how often the thesis talks about REST being designed around hypermedia. The thesis’s introduction refers to it as “REST, a novel architectural style for distributed hypermedia systems,” and also mentions this,

REST is defined by four interface constraints: identification of resources; manipulation of resources through representations; self-descriptive messages; and, hypermedia as the engine of application state.

and this:

REST was originally referred to as the “HTTP object model,” but that name would often lead to misinterpretation of it as the implementation model of an HTTP server. The name “Representational State Transfer” is intended to evoke an image of how a well-designed Web application behaves: a network of web pages (a virtual state-machine), where the user progresses through the application by selecting links (state transitions), resulting in the next page (representing the next state of the application) being transferred to the user and rendered for their use.

“Resource” is a pretty commonly used term, with its position as the “R” in “RDF” being only the tip of the iceberg. So what exactly is a resource?

The resource is not the storage object. The resource is not a mechanism that the server uses to handle the storage object. The resource is a conceptual mapping—the server receives the identifier (which identifies the mapping) and applies it to its current mapping implementation (usually a combination of collection-specific deep tree traversal and/or hash tables) to find the currently responsible handler implementation and the handler implementation then selects the appropriate action+response based on the request content. All of these implementation-specific issues are hidden behind the Web interface; their nature cannot be assumed by a client that only has access through the Web interface.

This note on MIME’s relationship to HTTP was interesting:

HTTP inherited its message syntax from MIME in order to retain commonality with other Internet protocols and reuse many of the standardized fields for describing media types in messages. Unfortunately, MIME and HTTP have very different goals, and the syntax is only designed for MIME’s goals.

Why shouldn’t you treat HTTP as a way to do Remote Procedure Calls? (And what’s my new favorite adjective to put in front of “scalable”?)

What makes HTTP significantly different from RPC is that the requests are directed to resources using a generic interface with standard semantics that can be interpreted by intermediaries almost as well as by the machines that originate services. The result is an application that allows for layers of transformation and indirection that are independent of the information origin, which is very useful for an Internet-scale, multi-organization, anarchically scalable information system. RPC mechanisms, in contrast, are defined in terms of language APIs, not network-based applications.

More on his carefully chosen terms “representation” and “transfer”:

HTTP is not designed to be a transport protocol. It is a transfer protocol in which the messages reflect the semantics of the Web architecture by performing actions on resources through the transfer and manipulation of representations of those resources. It is possible to achieve a wide range of functionality using this very simple interface, but following the interface is required in order for HTTP semantics to remain visible to intermediaries.

Keep in mind that this was published ten years ago, about a century in Internet time. It’s more relevant than ever, and I recommend that you put it high on your reading list.

3 Comments

By Nathan on November 20, 2010 6:41 AM

Exactly Bob, if I had a company, the first tasks for any new starts would be to:

1) Read the /full/ REST dissertation (and chapter 5&6 twice!).
2) Read the original design for the world wide web.
3) Read the early HTTP and HTML specs, and also as many Design Issues as possible.

Regardless of whether they’re a junior developer or a time served senior architect.

The only point I will add, is remember that each RFC and specification has it’s own definition of “resource” with slight difference throughout, as TimBL recently pointed out, it’s not a universal term with universal meaning across all specs.

Best,

Nathan

By Randy on November 23, 2010 7:13 AM

If you want to see how this line of thinking can be applied to software development in the smallest units of functionality, look at NetKernel (http://www.1060research.com/netkernel/). That software platform is based on a REST microkernel and allows you to build all of your software this way.

– Randy\

By Bob DuCharme {.commenter-profile} on November 23, 2010 8:41 AM

As a matter of fact, in Brian’s talk he discussed NetKernel a lot. It looks pretty cool.

Playing more with SPARQL 1.1 property paths

Bob DuCharme — Fri, 15 Oct 2010 08:58:09 -0500

Some fun new features.

I recently wrote about trying SPARQL 1.1 new query features with ARQ, and one thing I briefly tried was the new property paths feature. At the time, the query spec only had a placeholder for property paths, but the new version of it released yesterday has a detailed section on property paths with plenty of examples.

I had seen the separate document where this material was first drafted and tried out its examples, (except for the “Subproperty” and “Elements in an RDF collection” ones) and they all worked fine with ARQ 2.8.5. If you want to try themselves, this zip file has the sample data file that I mocked up and the 12 query files. (Thanks to Andy Seaborne for helping me to straighten out my data and some of my tests.)

They gave me more and more ideas for interesting queries that I can do with very little SPARQL code—for example, how to get a subtree of a hierarchy, or how to find nodes that have the same connection to the same nodes that a particular node has (for example, who likes the same bands that John likes, or who has the same friends that Jane has).

If you didn’t see the separate property paths draft document and you’re interested in SPARQL, it’s definitely worth skimming section 9 of the new query spec draft. There’s a lot of neat stuff there.

4 Comments

By Dan Brickley on October 15, 2010 1:12 PM

Re hierarchies, I’ve not looked into this yet properly,… but perhaps it is then a good fit for SKOS data, which builds hierarchies with skos:broader links?

By Bob DuCharme {.commenter-profile} on October 15, 2010 1:17 PM

Dan,

Hell yeah! I’ve already used it for that on work-related projects, where I’ve been getting much deeper into SKOS.

By Hrvoje Simic on October 27, 2010 11:39 AM

When you say “how to get a subtree of a hierarchy”, do you mean something like in the example:

?ancestor (ex:motherOf|ex:fatherOf)+

or did have something more complex in mind, like extracting a subgraph (CONSTRUCT form)?

Hrvoje

By Bob DuCharme {.commenter-profile} on October 27, 2010 12:27 PM

Something in between, I suppose, but I don’t think that a CONSTRUCT form would be that complex. See http://lists.w3.org/Archives/Public/public-esw-thes/2010Oct/0015.html for a few SKOS examples that I wrote out. A CONSTRUCT version of example 1 there would be something like

CONSTRUCT { ?c ?p ?o }
WHERE {
   ?c skos:broader* i:VariableStars .
   ?c ?p ?o .
}

Integrate disparate data sources with Semantic Web technology

Bob DuCharme — Thu, 30 Sep 2010 08:34:47 -0500

A new developerWorks article.

I’ve given a presentation to both the New York and Washington D.C. semweb meetups about how useful semantic web technology can be even if you’re data isn’t stored as RDF. I showed a little app that pulls (fake) buy/sell/hold recommendation from an Excel spreadsheet and the latest stock quotes and DBpedia about the relevant companies from the appropriate sources, converts this data to RDF as necessary, and then combines it all into a nice-looking HTML report.

The general architecture is more important than the specific implementation, and to make this clearer I implemented it with all free software and then again with TopQuadrant’s TopBraid Composer. I wrote an article about the free implementation that has just gone up on IBM developerWorks as Integrate disparate data sources with Semantic Web technology. It’s even a featured article for a few days.

2 Comments

By Neil McNaughton on October 5, 2010 2:34 AM

A great article. Just one thing, I would have liked to see a few lines of the RDF generated from Excel - to see what is being captured and processed.

By Bob DuCharme {.commenter-profile} on October 5, 2010 9:46 AM

Neil,

It’s pretty straightfoward, e.g.

  <rdf:RDF


    xmlns:anrecs='http://www.snee.com/ns/analystRatings#'


    xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'


    xml:base='http://www.snee.com/ns/analystRatings#'>


    <rdf:Description rdf:about='anrecs:1'>


      <anrecs:analyst>Nick Perkins</anrecs:analyst>


      <anrecs:tickersymbol>CAT</anrecs:tickersymbol>


      <anrecs:company>Caterpillar Inc.</anrecs:company>


      <anrecs:recommendation>SELL</anrecs:recommendation>


      <anrecs:date-time>2010-07-14T13:36:00</anrecs:date-time>


      <anrecs:description>Caterpillar has had an interesting quarter. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent sed lectus augue. Suspendisse nisl nisl, pulvinar eu luctus non, sodales non magna. Sed in metus arcu, sit amet ornare nunc. Duis fermentum, nibh quis fermentum sagittis, mi eros porttitor magna, sed dictum tortor quam ut lectus. Praesent eu est augue.</anrecs:description>


    </rdf:Description>


  <!-- etc. -->


  </rdf:RDF>

Bob

Les Frères DuCharmes

Bob DuCharme — Tue, 28 Sep 2010 09:15:06 -0500

Making music with dodgy electronics, in a damp basement, in 1983.

My brother Peter wrote a blog entry on some musical experiments that he and I did together 27 years ago. I’m confident that James “LCD Soundsystem” Murphy never heard our “Stop This Crazy Thing”, but his Losing My Edge 19 years later might give that impression. I guess any fan of 99 Records bands who owned a Casio PT-20 might have done something similar.

Fallback with SPARQL

Bob DuCharme — Wed, 22 Sep 2010 09:09:48 -0500

"Use this term if available, else fall back to that one".

Last April Richard Cyganiak tweeted the following:

@iand @ldodds “use this term if available, else fall back to that one” is common when consuming RDF, not well supported by SPARQL or RDFS

I took this as a challenge (if not as a very pressing one, if I waited this long to follow through). I managed to write a SPARQL query that reads the following data and sets ?label to the skos:prefLabel value if it’s available and otherwise to the rdfs:label value:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://rdfdata.org/whatever#> .


:thing1 rdfs:label "Robert"; skos:prefLabel "Bob" . 
:thing2 rdfs:label "Jane".
:thing3 skos:prefLabel "Frank".

Here’s the output, using ARQ:

-----------
| label   |
===========
| "Frank" |
| "Bob"   |
| "Jane"  |
-----------

Here’s a SPARQL 1.0 version of the query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


SELECT ?label    # Bind ?label to  
WHERE { 
  {              # skos:prefLabel if available
    ?s skos:prefLabel ?label . 
  }
  UNION          # and rdfs:label if not. 
  { 
   ?s rdfs:label ?label .
   OPTIONAL { ?s skos:prefLabel ?prefLabel .}
   FILTER (!bound(?prefLabel)) .   
  }
}

It asks for the union of any skos:prefLabel values and any rdfs:label values but to filter out any of the latter that have a skos:prefLabel property for the same subject. The query is verbose, and the FILTER(!bound()) trick is non-intuitive enough to have inspired two nicer substitutes in SPARQL 1.1: MINUS and FILTER NOT EXISTS. Here’s the query with MINUS:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


SELECT ?label    # Bind ?label to  
WHERE { 
  {              # skos:prefLabel if available
    ?s skos:prefLabel ?label . 
  }
  UNION          # and rdfs:label if not. 
  { 
   ?s rdfs:label ?label .
   MINUS { ?s skos:prefLabel ?prefLabel }
  }
}

You could substitute FILTER NOT EXISTS for MINUS there, and it would work the same way with a SPARQL engine that implements 1.1 such as ARQ.

It’s one less line than the SPARQL 1.0 version, and a bit easier to read, but it’s still a verbose way to assign skos:prefLabel to ?label if it’s available and otherwise rdfs:label. The important thing, though, is that it can be done with standard SPARQL, and that it’s a little easier with 1.1.

Can you improve on this query at all?

8 Comments

By Damian on September 22, 2010 10:47 AM

[Argh, captcha and validation are killing me]

What about:

{
?s ?p ?o . # or limit to a type
OPTIONAL { ?s skos:prefLabel ?label . }
OPTIONAL { ?s rdfs:label ?label . }
}

COALESCE would be nicer, however.

By glenn mcdonald on September 22, 2010 11:23 AM

Seems unfortunate that you have to repeat a whole pattern to get this to work, as the pattern you want in a real-world case could be substantially more complicated than this one. Is there a way to get both labels and then LIMIT 1, inside a subquery?

[In Thread that would be “Subject|(.prefLabel,Label:#1)”, although there’s also a built-in “otherwise” feature so this could be just “Subject|(.prefLabel;Label)”.]

By Bob DuCharme {.commenter-profile} on September 22, 2010 12:02 PM

Damian,

With ARQ, that gave me “Bob” twice, so I added ?s and ?p to the select statement and got this:

-------------------------------------------------------------------
| s                                    | p              | label   |
===================================================================
| <http://rdfdata.org/whatever#thing3> | skos:prefLabel | "Frank" |
| <http://rdfdata.org/whatever#thing2> | rdfs:label     | "Jane"  |
| <http://rdfdata.org/whatever#thing1> | skos:prefLabel | "Bob"   |
| <http://rdfdata.org/whatever#thing1> | rdfs:label     | "Bob"   |
-------------------------------------------------------------------

That’s close to what I was looking for, but obviously there’s a problem–I think rdfs:label bound ?label to “Robert” and then it got overwritten with “Bob” so that there are the two “Bob” results.

By Lee Feigenbaum on September 22, 2010 3:04 PM

Damian’s way is the standard way to do it. The only reason you’re getting the duplicates is because you’re selecting out the predicate as well and selecting for ?s ?p ?o before the optional.

if you did:

?s rdf:type

followed by the optionals, you’d get a single result for each as expected.

By Damian on September 22, 2010 3:42 PM

Hi Bob,

The explanation for that is nothing to do with my trick, but rather the ?s ?p ?o business before it.

For this trick to work you need ?s to be bound, so (for demo purposes) I added ?s ?p ?o. What you’re seeing is each triple, plus the (correct) label given the subject. There are two triples with thing1 as a subject, hence “Bob” is returned twice.

If you add types to the subjects you can try:

select ?s ?label
{
?s a skos:Concept .
OPTIONAL { ?s skos:prefLabel ?label . }
OPTIONAL { ?s rdfs:label ?label . }
}

which gives the expected answer.

By Bob DuCharme {.commenter-profile} on September 22, 2010 3:58 PM

Hi Lee,

That works if I assign rdf:type values to each of the resources in the data file. I assume there’s no other way to do it with the data as shown?

Also, if I do it like this (with the prefLabel pattern first),

OPTIONAL { ?s rdfs:label ?label . }
OPTIONAL { ?s skos:prefLabel ?label . }

?label gets bound to “Robert”, not “Bob”, I assume because it was looking for an rdfs:label value first. I didn’t realize that the order could be used to control things this way. I just looked through section 6 of the 1.1. Query spec and didn’t see anything about this; where can I find something in the spec about the effect of ordering the OPTIONAL clauses?

thanks,

Bob

By Lee Feigenbaum on September 22, 2010 4:32 PM

The order dependence of OPTIONAL clauses is an artifact of the semantics of OPTIONAL (LeftJoin in the algebra).

Lee

By Bob DuCharme {.commenter-profile} on September 22, 2010 10:25 PM

And now I see, Lee, that your General SPARQL Discussion and this blog post that you wrote covered the very issue I was wondering about long ago!

Semantic technology: more than the tools

Bob DuCharme — Thu, 16 Sep 2010 09:42:38 -0500

Nice tools, though.

At one point in the semantic technologies track of last week’s XML Summer School, I showed a little application I wrote where you enter the names of two film directors on a form, click the search button, and then see a list of all actors who’ve been in movies by both directors. The form calls a CGI script that creates a short SPARQL query, runs it, and generates an HTML page of the results. You can read more about it in the developerWorks article Build Wikipedia query forms with semantic technology; this particular form doesn’t use Wikipedia data but IMDB data from the Linked Movie Database.

The semantic web is really about the combination of tools and data.

The point of the demonstration was that semantic technology isn’t about everyone learning SPARQL, but about SPARQL becoming another technology to put behind interfaces such as web forms, just like Javascript and other scripting languages. After I demoed the form, Michael Kay (famous for XSLT and XQuery work in general and the Saxon processor in particular) told me that if the data had been available in XML he could have done the same thing with XQuery, and he wondered what exactly SPARQL added. I don’t remember exactly what I said, but with excellent hindsight I thought of a much better answer the following day.

Michael and several speakers from other tracks had come to the semantic web track with a reasonable question on their minds: what does this set of tools offer them that other sets of tools don’t? A flippant response to his question about XQuery would be “if the data had been available in XML? That’s an awfully big if!” A better answer would be that while some people focus on the tools, and others on the (linked) data, the semantic web is really about the combination of tools and data. If IMDB offered an SQL interface to their data, I could use that to list all the actors who’ve worked with two particular directors, but that too is a very big if. A SPARQL endpoint seems to be the most popular way to expose machine-readable data these days, even if the underlying data is relational. The Linked Movie Database offers the data in a SPARQL endpoint, so SPARQL is the query language for retrieving this information, and the sending of the query and the display of the results were easy with a CGI script.

So the answer to the question “what does this set of tools offer that other sets of tools don’t” is this: lots of great data to query, not to mention easy ways to convert other data formats for use by these tools. If you want to combine this data from multiple sources from across the web and from behind your own firewall, the ease with which the underlying data model lets you aggregate data is another big advantage. So, tools plus data plus ease of aggregation are what the semantic web has to offer, and that’s quite a lot.

1 Comments

By Víctor Penela on September 16, 2010 12:21 PM

A similar application from a friend of mine: http://10k.aneventapart.com/Uploads/310/

Totally agree on the need for more semantically aware apps, where the goal (the application) is the key, and not the means (the semantic technologies).

Trying SPARQL 1.1 new query features with ARQ

Bob DuCharme — Wed, 18 Aug 2010 09:27:02 -0500

Just about all there.

When I learned that release 2.8.5 of ARQ implements all of SPARQL 1.1 Query (“except for corner cases of property paths”, and Andy Seaborne recently told me that they’ve finished up that part) I decided to try out some of the SPARQL 1.1 features, and it was all pretty easy. I used ARQ from the command line and went through Lee Feigenbaum’s slides on the status of SPARQL 1.1 as a checklist of things to try. For sample queries and data I tried to use the examples in the SPARQL 1.1 spec wherever possible, but sometimes expanded on them a bit.

Projected expressions

When I tried projected expressions using the spec’s example I got an error because of the sample query’s use of the fn:concat function, but when I added the following prefix declaration to the query it worked with no problem:

PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

Aggregates

To test aggregates, the spec’s example works, but I added the following two lines to the end of the example’s data file so that org2’s author had a book with a price greater than 10:

:auth3 :writesBook :book5 . 
:book5 :price 17 .

This helped to demonstrate the GROUP BY part of the query better.

Negation

The MINUS keyword and the ability to use the NOT operator with EXISTS both provide a cleaner alternative to the FILTER(!bound(?varName)) trick used in SPARQL 1.0 to make missing values part of the retrieval criteria. Lee has a slide on MINUS, but not on NOT EXISTS. I tried NOT EXISTS as well because it’s new and is grouped together with MINUS in the spec. The spec even has a subsection on the relationship and difference between NOT EXISTS and MINUS.

Using ARQ, the spec’s examples for both MINUS and NOT EXISTS worked just fine.

Property paths

The spec only has a placeholder for this for now, but when I made up my own example after looking at Lee’s slide it worked fine. Here’s my sample data:

@prefix : <http://rdfdata.org/whatever#> .
:jane :knows :frank . 
:frank :knows :sarah . 
:sarah :knows :steve .

The query asks for everyone that jane knows and whoever they know, transitively.

PREFIX : <http://rdfdata.org/whatever#> 
SELECT ?person 
WHERE {
  :jane :knows+ ?person .
}

I may not have perfectly described the semantics of what that plus sign does here (an asterisk is another option), but you get the idea when you see the result:

----------
| person |
==========
| :frank |
| :sarah |
| :steve |
----------

This will let SPARQL-based applications do even more without needing an OWL inference engine, especially when used with the rdfs:subClassOf property.

Federated queries and subqueries

SPARQL 1.1 uses the same syntax for these that Jena (the framework behind ARQ) always had as an extension, so I’ve demonstrated these before in my Federated SPARQL queries blog entry of last January.

Time to play more with SPARQL 1.1

SPARQL 1.1 is no longer just a specification and something to debate about, but something we can actually play with, so go and do so and let the SPARQL Working Group know what you think. (See the paragraph beginning “Comments on this document…” in the Working Draft.) Personally, I’d prefer to see the spec include variable assignment, which are already part of Jena and ARQ (and Open Anzo) as an extension. I know that SPARQL 1.1’s new projected expressions and subqueries features can be combined for a similar effect, but that’s going to be pretty verbose. What do you think?

Converting CSV to RDF

Bob DuCharme — Wed, 11 Aug 2010 08:22:22 -0500

The simplest way yet.

There are probably dozens of ways to convert comma-separated values to parsable RDF, but I recently came up with one that was so simple that I wanted to share it.

Here is a sample CSV list:

"red" , "blue", "gray"

If I put the following before it and a period after it,

@prefix : <http://rdfdata.org/csv#> . :csvList :item

I get this: parsable RDF using the Turtle syntax.

@prefix : <http://rdfdata.org/csv#> . :csvList :item "red" , "blue", "gray" .

That’s it. It works as a single line like that, but it’s easier for human eyes to read if you look at it as an abbreviated version of the following:

@prefix : <http://rdfdata.org/csv#> .  
:csvList :item "red" .
:csvList :item "blue" .
:csvList :item "gray" .

Or, “the csvList resource has ‘red’, ‘blue’, and ‘gray’ as item property values”.

I just made up the URI, subject, and predicate. Your next step would probably be to use SPARQL to convert them to something more appropriate to your application.

I’ve used the semicolon in Turtle and SPARQL many times to avoid repeating a triple’s subject for multiple triples. I’ve used the comma, which delimits a list of objects that go with the same subject and predicate, less often, and it’s the key to the trick here: that a CSV list is already a part of Turtle syntax.

Converting CSV data to RDF in just about any programming language would be a very short script, and it’s easy enough with products such as TopBraid Composer, so I’m not interested in accumulating a list of other ways to do it here, unless you can beat mine for simplicity. I just thought it was neat that something as simple as prepending a short string and appending a period would turn a CSV list into legal, parsable RDF.

(photo: http://www.flickr.com/photos/jacksnell707/ / CC BY-NC-SA 2.0)

7 Comments

By glenn mcdonald on August 11, 2010 10:38 AM

Cute! Sadly, CSV escapes internal quotes by doubling them, and Turtle requires them to be escaped as \", so this trick will only work if you have no quotes in your data.

Of course, it also totally fails to capture any of the semantics of table rows/columns/cells, so it’s not like you were going to use it for real!

By Bob on August 11, 2010 11:43 AM

Glenn,

You’re assuming that spreadsheets are the only source of CSV. I’ve already used this trick for real, when I was passing a few values from a Javascript script to a SPARQL query that was acting on RDF data combined from several sources.

By glenn mcdonald on August 11, 2010 1:28 PM

Good point. Replace the word “real” in my comment with “whole spreadsheets”!

By Barry Norton on December 9, 2010 4:31 AM

Bob, that’s neither a list in CSV nor RDF/Turtle.

It’s a row in CSV (being picky, but I’ll explain why in a second), but more importantly you’ve created a set in RDF, not a list.

The Turtle list syntax would be:

:csvRow rdf:value (“red” “blue” “grey”)

Why does that matter? Because, as much as I agree with your “next step would probably be to use SPARQL to convert them to something more appropriate to your application” (it’s what we do in the JSON2RDF approach of Linked Open Services), you’ve lost the structure and can’t differentiate between columns in a graph pattern.

Jumping back to the comment about CSV rows, this is clearer if (instead of having homogeneous data across columns in your source), you had something like:

“red”, “FF0000”
“green”, “00FF00”
“blue”, “0000FF”
“yellow”, “FFFF00”

You could project this into a list of lists:

((“red” “FF0000”)
(“green” “00FF00”)
(“blue” “0000FF”)
(“yellow” “FFFF00”))

(A valid Turtle doc being:

Then could could actually make a construct like:

CONSTRUCT
{?item rdfs:label ?colour; rdf:value ?code}
WHERE
{[rdf:first ?item] .
?item rdf:first ?colour; rdf:rest [rdf:first ?code]}

Leading to:
[rdfs:label “red”; rdf:value “FF0000”] .
[rdfs:label “green”; rdf:value “00FF00”] .
[rdfs:label “blue”; rdf:value “0000FF”] .
[rdfs:label “yellow”; rdf:value “FFFF00”]

Ideally, of course, rather than these being blank nodes you would reuse or mint a URI scheme for them, but this requires two new features of SPARQL 1.1 to include in the query.

By Bob on December 9, 2010 8:05 AM

Thanks Barry, that’s interesting.

What I did worked for my needs–it wasn’t just a demo, but something in an actual application I was developing for a client–but I appreciate the clarification of terminology.

By Barry Norton on December 9, 2010 8:29 AM

No problems. Actually I only realised this was so long ago after I posted. I think you were in the thread about tools to achieve this, including Google Refine?

By Bob DuCharme {.commenter-profile} on December 10, 2010 8:46 AM

No, I wasn’t really looking for extra tools. I just had to hand off a bit of data from some Javascript to TopBraid Composer and was looking for the simplest way to represent it as parsable triples, and I thought it was neat how simple it turned out to be. Neat enough to blog it…

Jazz camp

Bob DuCharme — Tue, 27 Jul 2010 09:04:29 -0500

Theory and practice.

I hadn’t planned on writing here about my experience at the Jamey Aebersold Jazz Camp held at the University of Louisville in Kentucky in the second week of July, but it’s so easy to summarize the key lessons I learned about soloing that I thought I’d jot them down after all:

The best way to get to where the music in your head comes out of your fingers as you think of it is to is to record yourself singing a solo along with a given chord progression (a.k.a. scat singing), transcribe what you wrote onto music paper, learn it on your instrument, and then repeat.
Get more comfortable with the bebop scales. These are common scales with an extra note added so that if you play a long string of eighth notes and include that extra note as you go up and down the scale, you’re more likely to hit a chord tone on each downbeat. As the name implies, the technique was developed by the main musicians of the early bebop era.
Memorize the melodies of classic bebop tunes, especially the parts over the ii-V-I chord sequences (or even just the ii-V parts) so that you can use those licks over other ii-V-I chord sequences as they come up. They come up a lot.

Of course none of these are magic wands, but instead guidelines to productive ways to spend your practice time, and I have a lot of practicing to do. Bass players spend a lot of time practicing things to play behind the solos of other players, and my attempts at solos don’t measure up to what the horn, piano, and guitar players that I play with are doing. I think I have a better idea how to catch up now.

The camp is like an intensive week of music school, where you start each day with a theory class, and then have master classes with others who play your instrument, rehearsals with the combo you’ve been assigned to, and more classes. Placement in a theory class and combo depend on a written test and audition you do upon arrival. The faculty has great and important players on each instrument, and each evening ended with a concert over two hours long of various groups of faculty members.

Each combo has a faculty member assigned to oversee them. I was happy that mine had trumpet player Pat Harbison, because as a horn player he put together some cool background harmonies for the group’s horn section to play behind the vocals and non-horn solos. I’ve put pictures of Pat’s combo on flickr and a recording of our recital on my MySpace page.

The attendees are an interesting mix—this year roughly half of the 300 or so people were under 21. The picture of our combo is pretty representative (unfortunately, our singer is not in that picture, but you can see her in the flickr pictures). You don’t see many people in their twenties or thirties, because it’s mostly teenagers and those of us closer to middle-age. Staying in a dorm was a bit of a pain, but cost a lot less than a hotel. My roommate was an alto player from Puerto Rico who is tired of playing salsa, and he’s played with some big names such as El Gran Combo. (He only played with them as a sub; apparently, to get a full-time position, you have to wait for someone who plays your instrument in the band to die.) I had heard of El Gran Combo before, but never listened to them much, and have been doing so on GrooveShark, my new favorite music site. I highly recommend them.

On the last day, recital day, the 30 or so combos each play one song in one of two recital halls. It was fun to go back and forth and see different combinations of young and old attendees that I’d met during the week.

During the eight-hour drive back, I noticed that if you go south from the 70 miles of I-64 between Louisville and Lexington, many major bourbon distilleries are grouped together there. At first I thought this was an interesting coincidence, but then I remembered that bourbon is named for the Kentucky county where distillers first started to age their corn liquor in charred oak casks, and I was there (more or less—the boundaries have changed over the years). In the nineteenth century, drinkers down the river in New Orleans liked it so much more than the unaged corn whiskey from other places that they started requesting the Bourbon whiskey. I learned a lot more during my tour of the Wild Turkey distillery, but skipped the free samples at the end of the tour because of the seven hours of driving I still had ahead of me. It was still nice to get my mind off of scales and chords.

5 Comments

By Julie on July 27, 2010 9:40 AM

I can’t believe you couldn’t do the tasting at the distilleries! Probably wouldn’t be worth the extra hotel night so I’m glad your birthday present was waiting at home for you. Not one mention of what you ate, not very DuCharme of you.

By Mike Amundsen on July 27, 2010 10:27 AM

i remember working w/ Aebersold audio cassette tapes *years* ago; i even think i had a few of his LPs to practice against!

brings back fond memories.

By Bob on July 27, 2010 10:39 AM

At one point during the week I suggested that a turntablist with one of these LPs would make an excellent addition to a recital combo, although I’m sure it would have given Mr. Aebersold a heart attack.

By John Cowan on July 27, 2010 12:24 PM

Lots of French in this: Louisville, Bourbon, Orleans, and of course du Charme.

By Harold Carr on July 27, 2010 11:52 PM

What instrument do you play?

I play bass. Next week I’ll be at:

http://www.stanfordjazz.org/education/jazzresidency.html

Replace Facebook with FOAF + twitter + ?

Bob DuCharme — Thu, 17 Jun 2010 17:54:31 -0500

Making the first connection.

If we replaced Facebook with a decentralized collection of cooperating services that provide a similar collection of features, obvious candidates for some of these services are FOAF files, twitter, and flickr, but what would coordinate those services? Some have APIs and can store information that lets you make connections between the different services, so I wrote something to make one of those connections.

Twitter and related services such as identi.ca do an excellent job of replacing Facebook’s status updates. FOAF files were supposed to be the RDF geek’s ideal way to track friend networks, and while FOAF is probably the most popular vocabulary outside of Dublin Core, actual FOAF files have been used for little more than demos. If we could integrate them into this Facebook-like collection of services that I’ve been thinking about, they could become more practically useful, so I came up with a way to find someone’s FOAF file based on their twitter ID.

Looking up a FOAF file using a twitter ID

Twitter lets you specify a home page address as part of your profile, and twitter’s API can easily find out someone’s home page URL. A little RDFa in your home page can point to your FOAF file, and then a short script can take someone’s twitter ID, find out their home page URL, and then find out the FOAF file URL from the RDFa in that home page. I wrote a service that takes a twitter ID and returns the FOAF file URL, which you can try yourself with the URL http://www.rdfdata.org/cgi/twitterName2FOAFFilename.cgi?twitterID=bobdc.

Step one is adding the following RDFa to the body your home page, substituting your own home page and FOAF file URLs:

<div xmlns:foaf="http://xmlns.com/foaf/0.1/" typeof="foaf:Person"> 
  <span rel="foaf:homepage" href="http://www.snee.com/bob"></span> 
  <span rel="foaf:page" href="http://www.snee.com/bob/foaf.rdf"></span> 
</div>

It basically says “the Person with the following home page (which should be the same one named as your home page in your twitter profile) has the following FOAF file”. There’s nothing twitter-specific here, so the small number of triples that this embeds in your home page could serve many other uses, and of course you can add other information about the home page owner being described.

Once you’ve done this, you should be able to substitute your twitter account name for bobdc in the rdfdata.org URL above and see it return your FOAF file. (If you decide to revise the home page value in your twitter account before testing this, I’ve found that the data used by their API is not as recent as you would hope, so testing this with a revised home page value may require some patience.)

You can try the relevant twitter API call users show with The Onion twitter account using the URL http://api.twitter.com/1/users/show/theOnion.xml. In the returned XML, the url element holds the user’s home page URL. To parse and query the triples in the RDFa, I used the technique described in my last blog posting, except that this time I used the excellent http://www.dotnetrdf.org/demos/leviathan/ service to both parse the RDFa and query the triples, because sparql.org was down for much of this week.

If you do add this RDFa to your home page and the service works for you, let others know by tweeting “I’ve connected my twitter account to my FOAF file #twitter2foaf”.

Adding more Facebook-like features

Flickr lets you specify a home page as part of your profile, but I didn’t see a way to find this value using flickr’s API, which is about as simple and straightforward as twitter’s. Theoretically, the same idea should work with Facebook itself, because they have an API, but their API looks like a real pain compared to the others, and they don’t have a dedicated profile field to let a user name a home page outside of Facebook—funny thing!

If we can use tweets to send money to each other, it shouldn’t be that difficult to establish a twitter convention to “foaf:friend” someone—that is, to tweet “I’ll add you to my FOAF file at http://my/path/foaf.rdf if you’ll add me to yours at http://your/path/foaf.rdf”. Automating the addition would be a bit more work, but it’s not insurmountable.

A Google search for “the next facebook” gets hundreds of thousands of hits. I’d love to see people worry less about replacing Facebook services and more about developing technology to connect existing services into something that can substitute for Facebook. It’s great to see how people like Henry Story, with his FOAF+SSL work, are doing just that, and building on semantic web standards to make it possible.

6 Comments

By David Larlet on June 18, 2010 6:57 AM

I agree with you.

Note that you should accept FOAF files defined in link/meta too (like mine: ), I was a bit surprised that it doesn’t work with my twitter ID :)

By Bob on June 18, 2010 10:03 AM

David,

Sorry about that. There are different ways of expressing a FOAF value in a home page, and this looks for explicit triples that say that a subject who has this homepage also has a particular FOAF file.

Bob

By David Larlet on June 18, 2010 11:01 AM

Bob,

No problem about that, it was more a suggestion than a feature request.

David

By Kris Van den Bergh on June 22, 2010 1:35 PM

Hi Bob,

Great post! Very interesting stuff. Managing your friends in a decentralized way would be very cool. You already hinted at subsequent steps.

I was thinking how to “FOAF friend” someone via Twitter. Let’s say we write a twitter app. Do you agree that: 1) both parties should have installed the app. 2) Their FOAF files should be writable on their servers. I don’t see another way of doing this. The app itself could use a technology like SPARQL Push.

Am very eager to hear your thoughts!

By Bob on June 22, 2010 1:54 PM

Hi Kris,

1) I think that’s asking too much. It should be driven by a protocol that can have multiple implementations.

2) I’ve been thinking about this and here’s my idea: allowing write access to our individual FOAF files on our individual servers is a tall order, although as I mentioned FOAF+SSL might help. Something easier would be FOAF storage services that understood the request and probably still needed FOAF+SSL to know when there was permission to write to your data.

What keeps a given service from becoming the new Facebook, with too much centralized control over everyone’s data, is that you should be able to download your FOAF file from them and upload it to and use it on another hosting service with minimal trouble. The file on your server would just redirect to the URL used by the service as its identifier for your information.

It’s easy to download everything you’ve entered into del.icio.us; imagine if that was in a standardized format that you could upload that to a new service and use the same way. I picture FOAF file hosting service working like that.

By Yuriy on August 14, 2010 3:56 PM

I think that’s
ligament vCard + XFN (in RDF format) are better as FOAF.
OpenSocial not support FOAF.

RESTful SPARQL queries of RDFa

Bob DuCharme — Thu, 03 Jun 2010 10:02:31 -0500

No local parsing or querying software needed.

Facebook’s OpenGraph, Google’s Rich Snippets, BestBuy’s use of the GoodRelations vocabulary and other recent events are boosting RDFa’s popularity for storing machine-readable data in web pages. There are several tools and programming libraries available (not to mention built-in features of development platforms such as TopQuadrant’s TopBraid Suite for application development) that let you extract the RDF triples from this RDFa markup and use it, but I recently discovered how easily I can extract this data and perform SPARQL queries on it by just using publicly available, RESTful web services. The web page where the RDFa is embedded doesn’t even have to be well-formed HTML.

Getting the RDF triples out of the RDFa

I can say "extract the RDF triples from the RDFa on that web page and then run this SPARQL query against it" *all with a single URL.*

The W3C’s RDFa Distiller and Parser at http://www.w3.org/2007/08/pyRdfa/ has a form that lets you enter the URL of a web page and set various parameters before clicking the “Go!” button to see the triples stored in that web page. Once you do this, you’ll see the RDF on your browser (a View Source may be necessary) and you’ll also see, in your browser’s navigation toolbar, the REST URL you would use to have the same program extract the triples without you filling out the form first. (As the page tells you, “If you intend to use this service regularly on large scale, consider downloading the package and use it locally.”)

For example, if you go to this form and enter the URL of TopQuadrant’s products web page (http://www.topquadrant.com/products/TB_Suite.html), leaving all the other parameters at their default settings, clicking the “Go!” button will get you RDF/XML of the triples and, in the navigation toolbar, the URL used to retrieve them. I trimmed a few parameters off the URL and entered this shortened version directly into the browser, and it worked: http://www.w3.org/2007/08/pyRdfa/extract?uri=http%3A%2F%2Fwww.topquadrant.com%2Fproducts%2FTB_Suite.html&format=pretty-xml. I’ll come back to this URL below.

Querying the RDF

The sparql.org SPARQLer web form lets you enter a SPARQL query, specify a set of RDF to query and the return format, and then retrieve the result. For example, when I specify my FOAF file at http://www.snee.com/bob/foaf.rdf as the data to query and the following as the query, the SPARQLer lists my name and airport code, because I’m the only person in my FOAF file with both pieces of information:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX air: <http://www.megginson.com/exp/ns/airports#>
SELECT ?personName ?airportCode WHERE {
  ?person foaf:name ?personName ; 
          foaf:nearestAirport ?airport . 
  ?airport air:iata ?airportCode . 
}

If you pick a non-default output format at the bottom of that form, then instead of the results being displayed in your browser, they may get saved to your disk. When doing this as a RESTful call (for example, when using wget or curl) note the &output= parameter in the URL and experiment with other settings besides the default of XML.

My foaf.rdf file is a static text file sitting on disk, but here’s the cool part: I can enter any URI as the resource to query, as long as it identifies parsable RDF—for example, the URL above that gets RDF/XML out of the TopQuadrant products page.

Putting it together

The TopQuadrant products page uses mostly the GoodRelations vocabulary and the Yahoo! Searchmonkey Product vocabularies. (RDFa on other pages of the website use other mixes of different vocabularies as appropriate; let’s not take for granted how easy RDF makes it to do this.) If I want to use SPARQL to get a list of product names and descriptions from that page, I can take the URL above that extracts triples from the RDFa in the TopBraid products page, enter it as the “Target graph URI” value on the SPARQLer form, and put the following query into that form’s “General SPARQL query” box:

PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sm: <http://search.yahoo.com/searchmonkey/product/>


SELECT ?name ?description WHERE {
  ?product a sm:Product ;
           rdfs:label ?name ; 
           gr:description ?description . 
}

As with the RDFa Distiller and Parser, in addition to seeing the results of my query on the SPARQLer form, I’ll see the URL in the navigation bar that I could have used to execute the same query against the same data with a single URL instead of using the form. This is the grander cool part: I can say “extract the RDF triples from the RDFa on that web page and then run this SPARQL query against it” all with a single URL.

Of course it’s a long, messy-looking URL because of the URL-escaping of things like the spaces and punctuation in the SPARQL query. Any modern programming or scripting language provides a function that does this for you, and I’ve already written a perl script that does something pretty valuable with all this. More on that in a week or two.

5 Comments

By Subbu Allamaraju on June 3, 2010 10:42 AM

I was intrigued by the title of this post only to find that the “RESTful"ness here is encoding some query into a URI that clients can GET to. May be it should be titled “How to Encode SPARQL into URIs”

By Bob on June 3, 2010 12:05 PM

Subbu,

As a matter of fact, I didn’t say how to encode SPARQL into URIs, and mentioned that most programming languages have a function that will do that for you. I described some services to call with those URIs once you have them, and how the use of two of these services could be combined in one call.

Maybe I have an oversimplified idea of what qualifies as RESTful, but if a process can instruct processes on other servers to provide specific machine-readable information using HTTP GETs, I thought that qualified. It certainly can play a role in a useful distributed application.

Bob\

By Damian on June 3, 2010 12:13 PM

Hauntingly familiar:

http://www.semanticoverflow.com/questions/587/is-there-a-web-service-that-allow-me-to-run-sparql-against-a-xhtmlrdfa-website/588#588

There is something very pleasing about this sort of composition.

By carmen on June 13, 2010 8:18 PM

SPARQListas do some things over and over

select some set of resources
as you did in both queries:
(_, personName, )
( , type, someClass)

i’d call your approach of encoding an arbitrary query-language into a querystring argument RPC-ish rather than REST-ful

GET already has one URI per request. in the first example, theres exactly one URI in the triple pattern, so a single querystring key can apply “function that builds a (_ URI _) triplepattern” . the second example can specify the second URI in the querystring

once we have the set of resources, pulling out the names, locations etc can be done with existing tools like CSS selectors or XPATH. or much more concise RDF path-expression microsyntaxes that arent ugly smashed into a URL

obviously there are larger ad-hoc cases where you really want the power of full SPARQL, but the jump to the complexity of requiring a SPARQL engine is not necessary for some large swath of typical web needs. just like the world realized they didnt need SQL when a basic key/val hashtable store (with interesting sharding and distribution possibilities)

i like where youre going i just think it can be taken a lot further, and since i personally was annoyed by the notion of having to flush REST down the drain in favor of SPARQL i decided to scratch the itch

even basic things in HTTP are unspecified, for example how do you in-band into the URI Accept: arguemnts like the content-type. ive seen ?output=, ?format=, appending the extension of the format to the URI before the querystring, and countless other variations.

it would be nice if there were some standards there. maybe full-fledfged URI keys like myapi:format which could rdf:sameAs some standard definition of what to do with that querystring arg..

By Bob DuCharme on June 13, 2010 10:07 PM

Carmen,

In general, that all makes sense to me. I certainly didn’t mean to flush REST down the drain. I wanted to encode a request for a resource into a URL that turned out to have a lot of extra stuff, so (as I said before) perhaps idea of REST is too broad.

XPath (and for that matter, CSS) won’t work, though, unless the data conforms to a very specific structure that the person writing the query can take for granted. Then, of course, the query can be a lot simpler. I’ve worked with XML long enough to know that getting a wide variety of people to follow a specific DTD/schema in a wide variety of cases is a lot easier said than done, which is why I like the flexibility that RDF offers. This flexibility does shift the processing load elsewhere–in the case of my example, to the query engine–but the query engine software to do the work is out there, and I see it making a contribution to some real data processing problems.\

Writing applications for 2G phones

Bob DuCharme — Wed, 26 May 2010 12:54:56 -0500

Not fancy apps, but they'll work with billions of phones.

Two new habits of mine gave me a great idea for a simple application:

When I want to jot something down and have my phone but no pen or paper, I send an SMS text to my regular email address. (I ain’t got one of them fancy 3G phones yet. We’re still waiting for 3G coverage in our neck of the woods.)
When I need to send a quick message to one of my daughters at school to read when she gets a chance, I’ll send an SMS text to her phone from my email account, because I can type a lot faster on a full-sized keyboard than I can on my phone’s.

I realized that if SMS text messages can be sent or received as regular emails, and a little scripting can automate the handling of email, then a server-side script that responds to queries delivered to it as SMS text messages would not be difficult to write.

So I wrote a demo, and described it in the IBM developerWorks article Simple server-side 2G phone apps. Here’s the summary from the beginning of the article:

Mobile phones are transforming economies and societies all over the world, but often with phones that might be considered out-of-date by gadget geeks in more developed nations. The good news is that applications that work with these phones can be very simple to write, and they give your application a huge potential user base. In this article, learn how to write programs that respond to specialized requests for information from 2G phones.

When you send a text message of a US telephone area code to the email address I set up with this little application, it texts back a short description of the geographic coverage of that zip code.

When one of my daughter’s friends missed a call on her phone and wondered where in the country it was coming from, my daughter suggested that she text the area code to the email address that I had had her test so many times, and it worked, so it was nice to see the app take this small step beyond demo status.

I love the irony of how seemingly modern new applications can often be built with old-fashioned UNIX tools like procmail. Check out the article to learn more.

What's wrong with undeclared classes and properties?

Bob DuCharme — Fri, 30 Apr 2010 09:57:03 -0500

It's not like the RDF spec requires them.

OK, it’s a rhetorical question. I know the answer: we can attach metadata to class and property declarations, so when we know that a given instance is a member of a particular class and has certain properties, if those are declared, we know more about the instance and can do more with it, not least of all aggregate it more easily with other data that uses the same or related classes and properties.

I learned from Paula Gearon and Tom Heath tweets that section 2.3.2 of the “Weaving the Pedantic Web” paper (pdf) presented at the Linked Data on the Web conference in Raleigh bemoans the existence of undeclared classes and attributes. I agree that this is not a good thing, but we should be careful about attacking it.

The Pedantic Web paper does point out that “such practice is not prohibited”, which many people seem to forget. This reminds me of the decision to qualify merely well-formed XML as legal, parsable markup, which was one of the big breaks that XML made from SGML, or Tim Berners-Lee’s decision to accept the possibility of broken links in his hypertext system, unlike those of his predecessors. Serious XML-based applications still use DTDs or schemas and well-maintained web sites use some kind of link management, but the simpler, grass roots efforts don’t necessarily, and that turned out to be a great thing. It let these technologies grow to a point where millions of people can see their benefits.

If I have a triple that says

<http://www.snee.com/d/r/s3/l9d> <http://www.snee.com/8r/xa/32e>  "true"

and my subject and predicate aren’t declared anywhere, it doesn’t tell you much. If I have one that says this with an undeclared subject and predicate,

<http://www.snee.com/d/r/invoice#l9d> <http://www.snee.com/8r/xa/paid>  "true"

I worry that I fall into the standardista class because I think that using the word "semantic" in your marketing literature isn't enough to qualify your work as part of the semantic web.

you can get a general idea of what’s going on even with no declarations, as you often can from element and attribute names in XML documents that have no corresponding schemas. Unlike the XML example, though, we can see a domain name associated with “invoice#129d” and “paid” here, which gives some context and therefore a bit of semantics about them.

One great thing about RDF is that you can add on metadata after the fact, as Jim Hendler’s group at RPI is doing with a lot of the US government data. Third parties certainly can’t fix broken web links, and while James Clark’s wonderful trang can generate schemas from documents, that’s more useful as a content analysis tool than as something that you’d use to create production schemas. Adding metadata such as declarations to triples after the fact is a perfectly normal thing to do, and it helps connect those triples to each other to form a, you know, web.

I certainly don’t want to imply that the Pedantic Web effort is doing anything wrong; their efforts to educate people about the value of doing these things with more rigor are very valuable. In the name-calling that most discussions of new technology seem to devolve into these days (pedant! fanboy! standardista!), I worry that I fall into the standardista class because I think that using the word “semantic” in your marketing literature isn’t enough to qualify your work as part of the semantic web. I want to see support for relevant W3C standards involved, a position that apparently can get me lumped into the class of unreasonably demanding geeks who don’t appreciate the big picture, so I wanted to point out that the (spec-compliant) optional nature of class and property declarations can be a huge contributor to the growth of the semantic web.

XML and Tim Berners-Lee’s hypertext system scaled up to the point that they did because of both carefully engineered efforts and the fast growth of unrigorous ones. Careful engineering of a system using semantic web technology can get a lot of value from class and property declarations, but we should remember that the other great thing about RDF, besides the ease of adding metadata to existing data, is that triples are simple and easy to aggregate and therefore share. Let’s not discourage people from doing so if they don’t happen to be doing it the way that we would.

3 Comments

By Dan Brickley on April 30, 2010 11:57 AM

I’ve always said dereferencing is a privilege not a right; there will certainly be RDF/OWL vocabs that aren’t public, even while bits of data using those vocabs might leak out. This is fine and inevitable. The reason to describe your properties and classes, and make them deferenceable, is just that is makes folk more likely (and more able) to use them. And by documenting the ‘real’ vocab it makes error detection easier, since a typo in the name results in different behaviour. There are other ways around that one of course (eg. stats from aggregators).

It’s nothing fancier than - ‘If you want lots of people to use your stuff, document it carefully’. I don’t see any huge difference here between RDF, XML or general software documentation issues.

The classic undocumented properties in RDF are rdf:_12345 etc … maybe someone should update that schema, building on the fantastic Linked Open Numbers work? :) http://km.aifb.kit.edu/projects/numbers/

By Jiri Prochazka on April 30, 2010 2:34 PM

You are right, but emphasizing this doesn’t bring any advantages as far as usefulness is concerned. Let’s not forget RDF is meant to be consumed by machines, not humans. Machines cannot see inside URIs, not literals… So I wouldn’t call this helping the growth of the Semantic Web but rather helping the growth of the Linked Data. I expect knowledge using RDFS/OWL to be called Semantic Web, but this data I would be reluctant to call knowledge since is isn’t really machine understandable at all (anyway lot of markup oriented people are confused by this too).

Still it’s better than the mess we are in now…

By Dan Brickley on April 30, 2010 6:24 PM

@Jira, … even if people aren’t reading the RDF directly, they’re still often writing software that matches its patterns, or composing queries, or running analytics. And in practice this is often done in an example-driven manner. When developers encounter a new dataset, they’re far more likely to seek out example instance data, than to go meta and read the schema. The schema is there for reference and checking, but commonly skipped over until things go wrong. Examples are much more important to real usage…

The meaning of "semantics"

Bob DuCharme — Tue, 09 Mar 2010 18:48:11 -0500

No pun intended.

Dave McComb’s book Semantics in Business Systems recommended John Saeed’s Semantics as an “excellent introductory book on semantics in everyday life”, so I found a cheap used copy and have been working my way through it. I’m sure that it’s been used for both graduate and undergraduate courses, and it’s not too difficult to follow so far. I especially like this part, which Saeed said he adapted from the work of Charles Morris:

syntax: the formal relation of signs to each other;

semantics: the relations of signs to the objects to which the signs are applicable;

pragmatics: the relation of signs to interpreters.

He goes on to say that “the whole science of language, consisting of the three parts mentioned, is called semiotic”, but I was more interested in the way he put semantics in the larger context.

Printed and dictionary.com definitions of “semantics” typically come in pairs, with the first usually saying “the study of meaning” and the second more in line with Saeed’s definition. The latter is sometimes identified as being specific to the fields of linguistics or semiotics.

I think that the linguistics/semiotics definition serves the semantic web better, because describing semantics as the relations of signs to the things they signify (and moving some of the “meaning” parts that take place in peoples’ heads to the “pragmatics” category) helps us to focus on what the semantic web is the best at: providing an infrastructure to identify which signs (IDs in the form of URIs) refer to which objects (resources) so that people can use this infrastructure to create applications that work across the web.

Interpretation of the “meaning” of the signified resources is not necessarily a goal of these applications. While OWL can encode properties of concepts to let us do more reasoning with those concepts, attacking the feasibility of getting computers to Understand Meaning is a straw man argument that I’m tired of hearing from people who insist that the semantic web is an impractical idea. Standards and best practices that let applications track the relationship of identifiers to resources on a World Wide Web scale—who can argue with that?

(photo: http://www.flickr.com/photos/julianbleecker/ / CC BY-NC-ND 2.0)

4 Comments

By John Cowan on March 9, 2010 8:09 PM

Two experts*, to explicate meaning,
Wrote a book called The Meaning of Meaning.
The world still perplexed,
Three experts wrote next
The Meaning of “Meaning of Meaning”.

*Ogden and Richards

By Prateek on March 9, 2010 9:57 PM

Another excellent book in my humble opinion

http://www.semantic-web-book.org/page/Foundations_of_Semantic_Web_Technologies

By Kingsley Idehen on March 10, 2010 7:59 AM

Bob,

Nice post, and very well stated.

I think, Microsoft’s use of deep zoom images as symbols (rather than exposing http identifiers) for the human interaction aspect of Linked Data Browsing / Exploration UIs may finally drive home the mercurial essence of what Linked Data is fundamentally about.

If you haven’t done so already see:

1. http://www.youtube.com/watch?v=G29DBIEcIuQ – Microsoft Pivot in front of Virtuoso’s DBMS hosted Faceted Linked Data Navigation Engine

Kingsley\

By Mark Watson on March 10, 2010 12:04 PM

Bob, a good overview, thanks. Another good read (Ben Goertzel recommended this to me when we worked together): “Semantics, Primes and Universals” by Anna Wierzbicka.

Is SPIN the Schematron of RDF?

Bob DuCharme — Mon, 01 Mar 2010 18:56:45 -0500

Represent business rules using an implemented standard, then flagging violations in a machine-readable way.

Many complain about the potentially low quality of public semantic web data, but Fürber and Hepp are doing something about it.

Christian Fürber and Martin Hepp (the latter being the source of the increasingly popular GoodRelations ontology) have published a paper titled “Using SPARQL and SPIN for Data Quality Management on the Semantic Web” (pdf) for the 2010 Business Informations Systems conference in Berlin. TopQuadrant’s Holger Knublach designed SPIN, or the SPARQL Inferencing Notation, as a SPARQL-based way to express constraints and inferencing rules on sets of triples, and Fürber and Hepp have taken a careful, structured look at how to apply it to business data.

I knew that “data quality” was a specific discipline within IT, but I hadn’t looked at it very closely. Their paper gives a nice overview of this area before moving on to describing their work. It also describes the value that a systematic approach to data quality can bring to semantic web applications, but I don’t think anyone needs any convincing there; it’s often the first issue people bring up when they hear about the very idea of Linked Data on the web.

Or, to put it more bluntly, many complain about the potentially low quality of public semantic web data, but Fürber and Hepp are doing something about it. SPIN may have the potential to do for RDF data what Schematron has done for XML for years now: providing a technique, based entirely on an existing, well-implemented W3C standard, for describing business rules about data and then validating data against those rules. (I see that William Vambenepe had some thoughts on the comparison early last year.)

I’m looking forward to Fürber and Hepp’s future work described in their paper and to seeing how others apply it in their applications.

Using the ARQ SPARQL processor from the command line

Bob DuCharme — Thu, 21 Jan 2010 10:38:55 -0500

With the Jena extensions.

I recently described how to execute Federated SPARQL queries that use Jena extensions that we’ll hopefully see added to the SPARQL 1.1 standard. I showed a sample query and suggested that you try it at the sparql.org RDF Query Demo page.

For local, command-line use of SPARQL, I’ve used the Jena ARQ query engine for years, but my sample federated query didn’t work with it, and now I know why: the sparql.bat file that comes with the distribution invokes the processor in a strictly standards-compliant mode without the extensions enabled. I thought I’d have to write and compile some Java code to use the extensions, but my co-worker Jeremy Carroll pointed out that the sparql.bat file in ARQ’s bat subdirectory calls the arq.sparql library, like this,

java -cp %CP% arq.sparql %*

and that calling the arq.arq library instead enables the extensions. Then, I noticed the arq.bat file in the same directory as sparql.bat, and this is exactly what it does. There are more batch files in there, and a web search on their names led me to an ARQ - Command Line Applications documentation page, which will be handy.

Using arq.bat instead of sparql.bat, the sample federated query works as written (tested with ARQ 2.8.2), and so does LET assignment and extension functions, making it possible to use ARQ in real semantic web application development with no need to do Java coding around the Jena API.

(Thanks again, Jeremy!)

Live stock ticker data in RDF

Bob DuCharme — Tue, 12 Jan 2010 11:19:22 -0500

Well, on a 20-minute delay.

I’ve played with finance.yahoo.com’s feed of CSV stock ticker data before and recently had an idea that was so simple that I’m surprised that no one’s done it before: why not write a script that passes along a request for this data but converts the result to RDF before returning it? So I did.

I supposed it might count as a semantic web service.

A URL like http://www.rdfdata.org/cgi/stockquotes.cgi?symbols=BUD,IBM,SNE asks for recent ticker information about the stock symbols listed in the comma-separated value list. The stockquotes.cgi script adds the parameters to the appropriate stub to create a URL like http://download.finance.yahoo.com/d/quotes.csv?f=sl1d1t1ohgv&e=.csv&s=BUD,IBM,SNE, uses this URL to retrieve the CSV results, converts them to RDF/XML, and sends that back to the original requester with a MIME type of application/rdf+xml. The whole script, with white space and comments, wasn’t even 100 lines. You can click the first link in this paragraph to see an example of it in action.

I haven’t done anything with the rdfdata.org domain name in a while, so I thought that would be a nice place for this. I’ve already used this little web service in a work-related demo that combines and cross-references RDF data from multiple sources, because after all, that’s one of the things that RDF is so good at.

Is this a “semantic web service”? All it does is convert the data returned by a Yahoo feed into a different syntax and pass it along. I did throw together a little ontology to name the properties, but it doesn’t add a lot of semantics. On the other hand, my script’s output syntax is based on a semantic web standard, and it makes the data easier to use in semantic web applications, so I suppose it might count as a semantic web service.

I hope this is useful to others, and I hope that more people look for opportunities to convert live feeds of useful data in simple formats into live feeds of RDF.

7 Comments

By Daniel O’Connor on January 12, 2010 9:51 PM

http://www.freebase.com/view/user/doconnor/default_domain/views/nyse_companies

Down the bottom are some export CSV links.

Alternatively, you could view the “MQL” (like sparql if it were made of javascript/json), and an MQLread webservice to search for specific ids / matches

By Bob on January 13, 2010 12:43 AM

Thanks Daniel! I followed through a little there and found http://rdf.freebase.com/rdf/en.the_hershey_company , which is the first good example I’ve found of RDF from Freebase. I’m guessing that there’s a lot more…

By Melvin Carvalho on January 15, 2010 8:52 AM

Nice service, I challenged myself to write a wrapper on this in 15 minutes, and here’s my attempt:

http://marketdata.me/

By Erwin on June 7, 2010 11:38 AM

Bob, any idea why your link makes me download a file named stockquotes.cgi? I wanted to see another example besides Melvin’s so I can write something for my own purposes. Thanks.

By Bob on June 7, 2010 1:32 PM

Erwin,

For both mine and Melvin’s, Firefox just displays it, while Chrome and IE want to store it as you described. They store the RDF/XML file that Firefox displays.

I played with the HTTP header returned with the data, but couldn’t affect the behavior. The important thing to me is that it works with wget and curl, so that I know that it works as a RESTful web service. It isn’t really aimed at browser use. (If it was, I would have had it return an HTML file!)

By stock trading newsletter on July 14, 2010 9:36 AM

This looks good. Can you insert more criteria?

By Bob on July 18, 2010 9:19 AM

That was the best I could do with what I had available.

Federated SPARQL queries

Bob DuCharme — Mon, 04 Jan 2010 13:07:44 -0500

Using a Jena extension.

Much of the promise of RDF and Linked Data is the ease of pulling data from multiple sources and combining it. I recently discovered the SERVICE extension that Jena adds to SPARQL, letting you send subqueries off to multiple SPARQL endpoints and then combine the results. Because a given SPARQL endpoint may be an interface to a triplestore or a relational data store or something else, the ability to query several endpoints with one query is very nice.

The ability to query several endpoints with one query is very nice.

The Jena project’s ARQ - Basic Federated SPARQL Query describes the use of this keyword. Before I start quoting from that page, I wanted to jump right in with an example that worked for me to pull birthday and spouse information about Arnold Schwarzenegger from DBpedia and a list of his movies and their release dates from Linked Movie Database in one query:

PREFIX imdb: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


SELECT ?birthDate ?spouseName ?movieTitle ?movieDate {
  { SERVICE <http://dbpedia.org/sparql>
    { SELECT ?birthDate ?spouseName WHERE {
        ?actor rdfs:label "Arnold Schwarzenegger"@en ;
               dbpo:birthDate ?birthDate ;
               dbpo:spouse ?spouseURI .
        ?spouseURI rdfs:label ?spouseName .
        FILTER ( lang(?spouseName) = "en" )
      }
    }
  }
  { SERVICE <http://data.linkedmdb.org/sparql>
    { SELECT ?actor ?movieTitle ?movieDate WHERE {
      ?actor imdb:actor_name "Arnold Schwarzenegger".
      ?movie imdb:actor ?actor ;
             dcterms:title ?movieTitle ;
             dcterms:date ?movieDate .
      }
    }
  }
}

You can run this query yourself at the sparql.org RDF Query Demo page.

Before you start modeling your own queries on this, it’s worth reading the Jena documentation page mentioned above, especially the “Performance Considerations” part:

This feature is a basic building block to allow remote access in the middle of a query, not a general solution to the issues in distributed query evaluation. The algebra operation is executed without regard to how selective the pattern is. So the order of the query will affect the speed of execution. Because it involves HTTP operations, asking the query in the right order matters a lot. Don’t ask for the whole of a bookstore just to find book whose title comes from a local RDF file - ask the bookshop a query with the title already bound from earlier in the query.

As an example, both subqueries above specifically ask for information about Schwarzenegger instead of trying to scan the complete databases looking for matches.

Two parts of this trick are non-standard SPARQL, but may become part of SPARQL 1.1: subqueries and the SERVICE keyword. As the latter Lee Feigenbaum slide points out, the SPARQL Working Group is using ARQ’s SERVICE keyword as a starting point in thinking about how a query can target multiple endpoints.

My query above of the two different SPARQL endpoints also works from within TopQuadrant’s TopBraid Suite of products, so I’m sure I’ll be using this on work-related projects more and more.

3 Comments

By Taylor on January 4, 2010 9:07 PM

I knew we’d get you using Jena sooner or later. It’s got the best sparql IMHO.

By Karl Glatz on June 7, 2010 9:14 AM

Nice blog post!

I’m not able to test your query on the sparql.org Webpage, got some “Error 500: No dataset description for query”? Any suggestions?

By Bob on June 7, 2010 10:15 AM

Maybe one of the endpoints was down when you tried it. I just pasted the query above at the demo page, and it ran fine, i.e. it didn’t get an error. There were headers with no results under them, because DBpedia has changed the URL for birthDate to http://dbpedia.org/property/birthDate and no longer has a spouse value for Schwarzenegger, so the ?birthDate and ?spouseURI variables didn’t get bound.

The cleanup of DBpedia’s ontologies is obviously a good thing overall, but can break some queries. I have no idea why someone would remove his spouse value. Maria Shriver does have a spouse value of Schwarzenegger.

Bob

RDFS: The primary document

Bob DuCharme — Sun, 29 Nov 2009 12:10:41 -0500

Shorter and more interesting than I remember.

About two years ago I wondered if RDF Schema had become merely a layer of OWL or if anyone used RDFS by itself without OWL. My theory was that because tools such as TopBraidComposer, Protege, and SWOOP that let you design RDFS vocabularies also let you assign OWL properties to your classes, people used those because they were there, and we ended up with few pure RDFS vocabularies.

I heartily recommend that you read the first 11 or 18 page of the RDFS spec and skim the rest.

Lately, though, it seems that a lot of people who had been using the terms vocabulary/taxonomy/ontology interchangeably have started to understand better when OWL is too much. As they review the issues surrounding the choice between OWL 1 Lite, DL, and Full, around OWL 2 EL, QL, and RL, and the implications of open vs. closed world assumptions, more attitudes can be summarized as “sounds interesting, but pretty complicated; maybe later.” This makes good sense for people whose main interest is defining a standardized vocabulary.

SKOS looks pretty good to more and more of them, but here I want to focus on RDFS. As I thought more about it recently, I realized that I had never read the RDF Schema Recommendation, so about five years late I sat down to do so. It’s nice to remember, when you’re wondering about the true meaning of some term or the relationship between some concepts, that a spec is available where you can just read the official explanation of what’s what. (Of course, some specs are less enlightening than others when you’re confused about what they describe.)

I found the RDFS Recommendation to be an interesting mix of simple things that are commonly used and complex things that are rarely used. When I printed it out, it was 27 pages, but the summaries and references start on page 18, and the appropriately titled Other Vocabulary section on pages 12 through 17 describes the rarely used features. Let’s look at some interesting parts that lead up to that. From the Abstract:

This specification describes how to use RDF to describe RDF vocabularies.

Maybe that’s obvious to some, but it’s reassuring when confusion over vocabularies, taxonomies, and ontologies comes up. From the introduction:

The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web.

As opposed to being a data model. (It’s certainly not a syntax!)

Why do we need this schema language?

RDF properties may be thought of as attributes of resources and in this sense correspond to traditional attribute-value pairs. RDF properties also represent relationships between resources.

RDF however, provides no mechanisms for describing these properties, nor does it provide any mechanisms for describing the relationships between these properties and other resources. That is the role of the RDF vocabulary description language, RDF Schema. RDF Schema defines classes and properties that may be used to describe classes, properties and other resources.

The following is interesting for two reasons: first, because it describes a member of a class as an “instance,” reminding me that “individual” is definitely an an OWL term that has no particular role in RDFS. (A little later the document tells us that “the members of a class are known as instances [their emphasis] of the class”.) It’s also interesting as a nice summary of an issue that often confuses people with an object-oriented background.

The RDF vocabulary description language class and property system is similar to the type systems of object-oriented programming languages such as Java. RDF differs from many such systems in that instead of defining a class in terms of the properties its instances may have, the RDF vocabulary description language describes properties in terms of the classes of resource to which they apply. This is the role of the domain and range mechanisms described in this specification. For example, we could define the eg:author property to have a domain of eg:Document and a range of eg:Person, whereas a classical object oriented system might typically define a class eg:Book with an attribute called eg:author of type eg:Person. Using the RDF approach, it is easy for others to subsequently define additional properties with a domain of eg:Document or a range of eg:Person. This can be done without the need to re-define the original description of these classes. One benefit of the RDF property-centric approach is that it allows anyone to extend the description of existing resources, one of the architectural principles of the Web.

The role and relationship of the rdfs:domain and rdfs:range properties have confused me and many others. The spec’s description of their use is rather technical (nothing wrong with that; it’s a spec) but there’s this nice passage after that:

…an RDF vocabulary might describe limitations on the types of values that are appropriate for some property, or on the classes to which it makes sense to ascribe such properties.

The RDF Vocabulary Description language provides a mechanism for describing this information, but does not say whether or how an application should use it…

For example, data checking tools might use this to help discover errors in some data set, an interactive editor might suggest appropriate values, and a reasoning application might use it to infer additional information from instance data.

RDF vocabularies can describe relationships between vocabulary items from multiple independently developed vocabularies. Since URI-References are used to identify classes and properties in the Web, it is possible to create new properties that have a domain or range whose value is a class defined in another namespace.

I think that makes some basic issues clearer.

I have mixed feelings about the “Other vocabulary” section on features that, from what I’ve seen, never got much traction: container classes and properties, RDF collections, and reification. On the one hand, usage of these can appear so complex that I think it scared a lot of people away from RDF in the early days, obscuring the simplicity of the triple as the fundamental concept of RDF. On the other hand, as I read about these options now, they looked like they could be fun to play with, in a geeky sort of way. (I also realize that the whole concept of reification—the ability to refer to triples as resources themselves so that properties can be assigned to them—is an important bit of RDF foundational architecture for other good RDF-related ideas to build on.)

So, whether you’re new to the whole idea of a standardized definition of a vocabulary or you’ve been using OWL and RDFS together for years, I heartily recommend that you read the first 11 or 18 page of the RDFS spec and skim the rest, which includes some handy reference material.

1 Comments

By Dan Brickley on November 30, 2009 5:49 PM

Finally, someone read it! Thanks ;) I remember bits of those paragraphs coming together from contributors to the original RDFS WG, 98/9, and other things from the RDF Core makeover. There’s lots I’d do differently now but that’s life! There are some other bits that got mostly dropped from the doc at some point, eg. about the Warwick Framework,
“““RDF and the RDF Schema language were also based on metadata research in the Digital Library community. In particular, RDF adopts a modular approach to metadata that can be considered an implementation of the Warwick Framework [WF]. RDF represents an evolution of the Warwick Framework model in that the Warwick Framework allowed each metadata vocabulary to be represented in a different syntax. In RDF, all vocabularies are expressed within a single well defined model. This allows for a finer grained mixing of machine-processable vocabularies, and addresses the need [EXTWEB] to create metadata in which statements can draw upon multiple vocabularies that are managed in a decentralized fashion by independent communities of expertise. "””

http://www.w3.org/TR/2000/CR-rdf-schema-20000327/

Converting Word documents to DITA

Bob DuCharme — Fri, 20 Nov 2009 09:37:31 -0500

Via OpenOffice and DocBook.

I recently had to convert a few Microsoft Word documents to DITA XML and thought it would be worth sharing my notes on the steps I took. To summarize, I opened each Word document with OpenOffice 3.1, saved it as a DocBook XML document, and then converted that to DITA with the XSLT stylesheet from a DITA plugin that I found. Images were a little more trouble, but at least I was able to eventually automate that part as well, dispelling my worries that I’d have to add all the image references to the DITA files by hand.

Word to DocBook

When you open a Word file with OpenOffice and do a Save As DocBook, it assumes that the document uses default Word styles, because that’s how OpenOffice knows what’s what in the document’s structure. The conversion does an impressive job of adding wrappers in the appropriate places considering that it’s using an XSLT 1.0 stylesheet. This kind of stylesheet would be much easier to write with XSLT 2, but that reduces the choice of XSLT processors that you can use. It doesn’t matter much from the user’s perspective, because it’s all under the covers anyway. The key thing is the convenience of creating the DocBook version from OpenOffice with a simple Save As.

On the down side, some nested bulleted lists in the original content did not show up in the DocBook version. I found this after converting the eventual DITA version of one of these documents to a PDF file with the DITA Open Toolkit and skimming through the original Word file and the new PDF to do a block-by-block comparison. (I strongly recommend this QA step if you’re doing this conversion with important content.) Many bulleted lists got converted to numbered lists as well, although I’m not sure if this was the fault of the Word to DocBook conversion or of a later stage described below. Another small issue is that when the original had more than one space character in a row, all but one got converted to hard spaces to maintain the spacing in XML. I just deleted all the hard spaces from the DITA version with a global replace, but you may want to keep them, depending on how the documents use them.

Typical Word users add space between paragraphs by inserting an extra carriage return, instead of adjusting the styles included with document, so your output from this conversion step might have a lot of empty para elements. You can delete this with a simple XSLT stylesheet or even a global replace in a text editor.

Adding the images

One annoying detail was that the DocBook files created by OpenOffice lack references to the images. When you save a Word file as an OpenOffice native odt (that is, zip) file, you can see that the content.xml file in there has simple, straightforward references to image files that are also in the zip file. The references look like this:

<draw:frame draw:style-name="fr1" draw:name="graphics63" 
  text:anchor-type="as-char" svg:width="6.8972in" svg:height="2.6264in" 
  draw:z-index="49"><draw:image 
  xlink:href="Pictures/10000000000003430000013EC16739CA.png"
  xlink:type="simple" xlink:show="embed" 
  xlink:actuate="onLoad"/></draw:frame>

(I had created the original image in the Word file by pasting it from somewhere else, so the conversion of each to a standalone png file was a nice bonus.) OpenOffice’s Save as DocBook feature doesn’t save these image references; the DocBook 4.1.2 version of the above that it creates looks like this:

<inlinegraphic fileref="embedded:graphics63" 
    width="6.8972inch" depth="2.6264inch"/>

(Note that DocBook 5 deprecates the inlinegraphic element.) After no luck tinkering with the sofftodocbookheadings.xsl stylesheet included with OpenOffice to create the DocBook file, I replaced its contents with an identity transformation to see what it was using as input. It turned out that it wasn’t using the original content.xml file mentioned above but some intermediary file that had replaced the xlink:href value above with a child element that stored the actual content of the image, like this:

<draw:image draw:style-name="fr1" draw:name="graphics63"
            text:anchor-type="as-char" svg:width="6.8972inch"
            svg:height="2.6264inch" draw:z-index="49">
  <office:binary-data>iVBORw0KGgoAAAANSUhEUgAAA0MAAAE+CAIAAADAgVy 
   <!-- lots more data here--></office:binary-data>
</draw:image>

At least the draw:name value of the draw:image element’s parent draw:frame element gets preserved in the DocBook output as the value of the fileref attribute, so instead of digging intp OpenOffice’s architecture to see what was preparing the input for sofftodocbookheadings.xsl and trying to fix that, I wrote a getImageNameData.xsl stylesheet to pull the {draw:name, xlink:href} pairings from the original content.xml file. Then, I wrote an addImageRefs.xsl stylesheet to look up the image filenames in the getImageNameData.xsl output and insert them into a new copy of the DocBook file.

DocBook to DITA

Eric Hennum describes a docbook2dita plugin for the DITA Open Toolkit in this posting on a DocBook list. My first attempt to use it from within the DITA Open Toolkit resulted in the errors discussed in a DITA group thread that ends with this posting from Mark Peters, who came up with a very simple solution: instead of running the conversion as a plugin, just call the XSLT stylesheet included with the plugin directly and tell it where your input is and where the output should go. The basic form of the command line that he shows worked for me.

Testing it

The first test to pass was whether the result was valid to a DITA DTD, and that went fine. The second test was the big one: whether the HTML and PDF created from the document by the DITA Open Toolkit looked right. In general it did, except for the issues described above, which showed that a block-by-block comparison of each PDF with the original Word file is worth the trouble. If I had to do a large amount of these conversions I’d dig deeper into the nested bulleted list and bulleted/numbered list issues in the hopes of reducing the need for this final manual step.

So far, though, the automation steps that I found or put together are definitely saving me tons of potential manual work. I only had to do this to a few documents, so I didn’t mind executing each step one a time, but if you want to use OpenOffice to convert a large amount of documents, I wrote something in XML.com called Moving to OpenOffice: Batch Converting Legacy Documents a few years ago that should help.

3 Comments

By David Kelly on November 20, 2009 12:47 PM

Bob,

While we haven’t documented it, Scriptorium did a web presentation not long ago on a similar conversion process using the same tools you use. I have put together an Ant script that controls the processing chain from beginning to end, and we also added a bursting script that takes the large DITA file output and creates individual topic files with a ditamap to hold them all together.

Some processing we do in Ant includes fixing Unicode by wrapping &#x and ; around the 4-digit code from the \unnnn instances in Word. Also, I wrote an XSL script that fixes autonumbering in the OO XML document before it gets converted to DocBook. It uses an identity transform with an exception that looks for this:

text:list[contains(@text:style-name,‘Outline’)]

In the transform, it keeps the descendant text:h tags and discards the text:list tags. Autonumbered sections in Word cause the Docbook-to-DITA script not to pick up the headings, so no topics are output.

I have used this process for 500-page Word documents, and it appears to be reliable, for the most part. Large tables slow it down considerably. Occasionally we run into Word styles that cause problems in the OO-to-DocBook conversion, so you are right, the results must be checked carefully. But as a conversion method, it sure beats cut and paste.

Glad to see that great minds think alike!\

By Bob DuCharme on November 20, 2009 12:52 PM

Thanks David! I wrote out my notes to help others who may try something like this in the future, and your comments will definitely be a further help to them.

By Jeroen Baten on February 12, 2010 3:52 AM

David, would you be kind enough to post the ant script itself? It would help me greatly in starting the toolchain!

Simple semi-structured data entry

Bob DuCharme — Wed, 11 Nov 2009 20:48:55 -0500

With RDF.

When most people want to take notes on a collection of things, and they know that the notes will have some structure but they’re not sure about the nature of that structure just yet, they use a spreadsheet. For each thing that they take notes on, they add a new row; for each attribute of the things under review, they add a column. From an investment banker comparing potential investments to a scout leader planning a camping trip, the grid makes it easy for you to compare similar attributes of different things without forcing to you to specify all of your attributes before starting your data entry like a more serious database application would.

In theory, RDF is ideal for this, because you can assign any attribute name/value pair to any resource that you can identify with no requirement to plan it all in advance, but in practice, it’s rarely as easy as pouring names and numbers into a spreadsheet. I’ve often thought that it would be fun to build a freeform database program that lets people do data entry and make up new fields as they go along, all with RDF underneath. I even wrote some Python code for this a few years ago, but never followed through. Since joining TopQuadrant, I’ve wondered about assembling something like this with the company’s application development tools, but then I realized that the Free Edition of TopBraid Composer pretty much already does this.

Here’s a use case that’s happened to most people in the modern workforce: you’re told that you’ll be joining a particular project, and to get you started someone emails a zip file of relevant files for you to review. For my notes on these files, I might create a text file or a spreadsheet, but I’d probably assemble an XML file where I made up element names as I went along. These elements would track the filename, document title, author, age, comments, and probably some project-specific fields. When the big picture starting coming into focus, I’d write a little XSLT to convert this XML to presentable HTML to show to others if necessary.

A key reason that this would be easy for me is that the Emacs nxml mode automates much of the work of entering tags and keeping everything well-formed. How would doing it in RDF be better? I could do the same steps as above using RDF-Friendly XML and nxml’s excellent handling of RDF/XML, but I’d rather use a form-based interface instead of Emacs. This is where the free edition of TopBraid Composer comes in.

The first step is creating an RDF data file with all the easily available file metadata: the name, size, and last modification date for each file. I wrote a simple perl script called dir2rdf.pl to do this; it’s simple because it declares a File class and all the properties for that class in the namespace declared for the file. (I also created a slightly more complex perl script called dir2nfordf.pl which does the same thing but uses existing classes and properties from the NEPOMUK File Ontology. It’s more complex because this ontology has properties based on properties from other vocabularies such as Dublin Core, so editing data with this ontology means pulling in a few layers of other ones.)

When you pipe the result of the Windows dir command into the simpler perl script, it outputs the property and class definitions for the files and an entry like this for each file:

  <File rdf:ID='file11' sd:lastModified='2009-10-30T17:05:00'
        sd:fileName='teams.csv' sd:fileSize='164' rdfs:comment=''/>

Loaded into the free edition of TopBraid Composer, the editing of that “record” looks like this (I’ve rearranged the combination of screen sections a bit from the default TopBraid “perspective”, to use the Eclipse parlance):

I can edit the values on this form, although there’s no reason to edit the file name, size, or last modified values. What I’m really going to do is add notes to the rdfs:comment property, as I’ve already done above, and perhaps add more comment properties for this resource. The really nice part is that I can define new properties in the Properties view on the right—for example, some project-specific subproperties of rdfs:comment—drag them onto the form for any of my File resources, and then add values to them, giving me the functional equivalent of adding new columns to a spreadsheet.

It’s actually better than that, because if I wanted to add three contactWithQuestions names to one of these File resources on a spreadsheet grid, I’d have to either add three columns or string together three values in one spreadsheet cell as if they were one. With RDF, though, I can define a contactWithQuestions property and then add three separate values for this property to the same resource. Moving beyond the use of simple string data for the values here, I could create object properties (properties where the value is another resource—in this case, to define relationships between File objects such as mentionedIn or basedOn) by defining them in the Properties view on the right with a range of File. When I want to assign one of these properties to a particular File object, I would drag it from the property list on the right onto the Resource form for that File and then pick out the appropriate file it refers to from a drop-down list. For example, after creating a mentionedIn property, if teams.csv was mentioned in index.html and I wanted to record this in my notes on teams.csv, I’d drag the mentionedIn property onto the Resource Form for teams.csv and select index.html as the value for that property.

Because this is a GUI editing interface, I can also add and delete new File resources (the equivalent of inserting and deleting rows on a spreadsheet) by clicking icons on the Instances view at the bottom. (Another nice bonus with TopBraid Composer is the SPARQL tab next to that, where you can enter and run SPARQL queries about the data.)

So, I’ve got my form-driven interface that I can use with any RDF data. I’ve kept my address book in RDF for a long time; maybe I should try maintaining it like this instead of with Emacs.

Up and running with Mercurial

Bob DuCharme — Mon, 26 Oct 2009 09:50:12 -0500

Quick and easy.

I’ve used the cvs and svn version control systems for both work-related and personal projects. For personal work, I used svn in particular more as a backup program, with the added benefit of the version control. Keeping my repository on a thumb drive made it easy to perform the backups when traveling, but perhaps because of sloppiness in removing the thumb drive without clicking the right icons first, my repository got corrupted too often, so I gave up.

I decided to try again with Mercurial and was shocked at how quickly I was able to learn it and get it to do everything I wanted—about an hour. This blog posting convinced me to try it before git, which sounds fascinating but a bit more complicated. By keeping my repository on the local drive and using the clone feature to keep backup copies of the repository elsewhere, I can redo a backup if a thumb drive version gets messed up.

The Mercurial Quick Start lives up to its name, and I kept some notes as I went along to provide my own Mercurial quick reference:

hg init Turn current directory into a project. hg add``` Add files in current directory to repository. hg ci -m “comment about this commit” Commit recent changes to repository.hg clone . e:\otherCopy Create a clone of the current directory's repository somewhere else.hg push e:\otherCopy Send recent changes in this directory's repository to a clone repository (that is, back up the changes here to there).update (entered from within e:\\otherCopy directory) Make clone directory's contents reflect recent changes to clone repository.hg log test1.txt List comments (see -m above) for each of test1.txt file's changes.hg revert -r 1 test1.txt Revert file test1.txt to revision 1. (You can then "revert" it to later versions.hg cat -r 2 test1.txt Look at version 2 of test1.txt.hg locate foo` List files in repository with “foo” in their names.

~~One other note: an .hgignore file tells hg files which to ignore, and putting separate .htignore files in subdirectories of your main project directory works fine.~~

I once had grand ideas about hooking up a version control system that can assign arbitrary metadata with an RDF triplestore to form the basis of some sort of CMS demo. Mercurial isn’t much help here, but when I prioritize the tasks “back up my stuff” and “build a demo CMS around a version control system” the former is clearly much more important. Maybe someday…

5 Comments

By Norman Walsh on October 26, 2009 7:00 PM

Submitted without comment: http://www.whygitisbetterthanx.com/.\

By Bob on October 26, 2009 7:46 PM

In my entry above I wrote that I “was shocked at how quickly I was able to learn it and get it to do everything I wanted—about an hour.” I exaggerated a bit; it was closer to an hour and half. On September 24th at about 11:30 AM GMT I told Norm “I’m going to try Mercurial now”. At about 1 PM I told him “I tried it and really liked it”. And now he tells my about whygitisbetterthanx?\

By Dirkjan Ochtman on October 27, 2009 4:40 AM

Don’t believe the git hype…

BTW, you name ‘.htignore’ here, where I believe you mean ‘.hgignore’, and actually putting those in subdirectories of a repo doesn’t work (not sure why you have the impression that it does).

Actually Mercurial’s filelog structure has a neat metadata mechanism built-in where you could store versioned metadata per file, and the changelog (where the changeset history graph is stored) can also hold arbitrary bits of metadata, so you might be able to leverage those capabilities to do some interesting things.

By Bob on October 27, 2009 9:53 AM

Thanks Dirkjan. I fixed the .hgignore filename and then struck out that sentence. I guess my opinion was based on a quick test where I wasn’t paying enough attention.

By Bob on November 11, 2009 8:11 PM

This is a test…

Blogging on TopQuadrant's Blog

Bob DuCharme — Wed, 14 Oct 2009 19:59:02 -0500

In addition to here.

I just added my first entry to TopQuadrant’s blog, Voyages of the Semantic Enterprise. It’s called SPIN Tutorial Available, and describes the tutorial I recently finished writing on using the SPARQL Inferencing Notation with TopBraid Composer.

I’ll be adding more to that blog in the future and certainly continuing with this one here, keeping the entries that focus on TopQuadrant technology over there. I’ll put the others—including more general interest entries on the semantic web, SPARQL, and RDF—right here.

Thanks for reading either!

1 Comments

By Piers Hollott on October 15, 2009 4:41 PM

But will your entries on TopQuadrant’s blog be marked up with Dublin Core RDFa? (ah, blogspot)

Looking forward to more on either venue. It’s great that TopQuadrant is encouraging their consultants to blog. Kudos!

Cheers,
Piers

A rules language for RDF

Bob DuCharme — Thu, 01 Oct 2009 12:45:05 -0500

Right under our noses.

Last May, in Adding semantics to make data more valuable: the secret revealed, I showed how storing a little bit of semantics about the word “spouse”—the fact that it’s a symmetric property (that is, that if A is the spouse of B, then B is the spouse of A)—let me look up someone’s home phone number in my address book even if my entry for him there lacks his home phone number. I like this story because unlike biotech and some of the other popular domains for Semantic Web technology, everyone has an address book and understands the basic properties of an entry: first name, last name, email address, and so forth. (Because so many people have lived through the annoyances of moving their contact information from one email client or phone to another, address books also provide nice use cases for data integration issues.)

Back then, I wrote:

With software that understands an OWL expression stating that spouse is a symmetric property and a rule I define to say that spouses have the same home phone number, I can retrieve Leroy’s home phone number…

OWL is great for defining the symmetry, but I glossed over the part about defining the fact that spouses have the same phone number. How do you define such a rule? n3 has a rules language, but I haven’t seen it used much as the n3 subset known as Turtle (which leaves out such things) becomes more popular. Instead of defining a Semantic Web rules language, the W3C has decided to have the Rules Interchange Format Working Group standardize an interchange format between the many rules languages out there. (The W3C Rules Interchange Format Basic Logic Dialect PowerPoint presentation by WG co-chair Chris Welty provides good historical background.)

I can write a query that generates the triples I want to infer and call this query a "rule", but what do I do with it?

I’ve used a proprietary RDF rules language before, and was wondering if a standard one would come along. Some colleagues at TopQuadrant have shown me that we all have a straightforward, standardized RDF rules language right under our noses: SPARQL. I’ve been appreciating SPARQL’s CONSTRUCT form more lately, and CONSTRUCT is the key here: like a SELECT statement, a CONSTRUCT statement defines conditions about which pieces of which triples to retrieve, but unlike SELECT, a CONSTRUCT statement assembles these into new triples. If we view a CONSTRUCT statement as the definition of a rule and the resulting new triples as the result of the execution of the rule, then we have a rules language and plenty of implementations of it available.

For example, the following SPARQL “rule” says that if ?person1 has the spouse ?person2 and the home telephone number ?phoneNum, then ?person2 also has the home telephone number ?phoneNum:

PREFIX  : <http://www.snee.com/ns/demo#>
PREFIX v: <http://www.w3.org/2006/vcard/ns#>


CONSTRUCT { ?person2 v:homeTel ?phoneNum . }
WHERE {
  ?person1 :spouse   ?person2 ;
           v:homeTel ?phoneNum .
}

When run with the following data (for the purposes of this demo, assume that the {:leroy :spouse :loretta} triple was generated by an OWL reasoner that saw {:loretta :spouse :leroy} and knew that :spouse was symmetrical),

@prefix  : <http://www.snee.com/ns/demo#> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
:loretta :spouse   :leroy ;
         v:homeTel "434-923-9321" .
:leroy   v:workTel "434-932-5329" ;
         :spouse   :loretta .

It generates the triple {:leroy v:homeTel “434-923-9321”}.

OK, so I can write a query that generates the triples I want to infer and call this query a “rule”, but what do I do with it? What makes it a rule about a particular set of data?

Holger Knublauch, a co-worker of mine who designed and developed the OWL plugin for Protégé before coming to TopQuadrant, recently wrote an RDF vocabulary called SPIN (“SPARQL Inferencing Notation”), which—among other things—can express associations between these rules and classes. So, for example, if the blank node rdf:_1 pointed to the query above, the following triple would associate this query rule to the v:Address class:

  v:Address spin:rule rdf:_1

To make the storage of the SPARQL rule in a triplestore even cleaner, Holger has implemented a way to store SPARQL queries as triples, and he’s written the code to roundtrip between this and the standard text version. (See the SPARQL Text to SPIN RDF Syntax Converter for an online converter, and see spinrdf.org for more about what else the SPIN vocabulary can do, especially his blog entries as he developed it. I’m now finishing up a tutorial for the use of SPIN features in TopQuadrant products, and except for one optional step of the tutorial, it all works with the free version of TopBraid Composer.)

When you take it a little further, symmetrical properties and many other parts of OWL can also be implemented with SPARQL queries, and there’s a lot going on among those who are doing this to find a sweet spot between RDFS and OWL Full that meets typical business needs without using a lot of processing power or dollars.

6 Comments

By Kingsley Idehen on October 1, 2009 3:57 PM

Bob,

Yes, SPARQL CONSTRUCT is a rule language in its own right for controlled “forward chanining”. The optical illusion that many have missed is this: Other Rules Languages have Head and Body on a Horizontal Plane, while SPARQL CONSTRUCT’s plane is vertical :-)

SPIN is neat formalization of the basic concept via a controlled vocabulary; certainly something we wil use, as yet another mechanism for showcasing this aspect of SPARQL esp. in the Virtuoso Sponger Middleware which is already a constained forward-chaining mechanism within the general non RDF to RDF processing pipeline.

Kingsley

By glenn mcdonald on October 1, 2009 5:20 PM

For a coincidental post about this same concept, query language as rules language, in another context, see http://www.furia.com/page.cgi?type=log&id=330 .

Kingsley, what does your horizontal/vertical comment mean? What other Rules Langauges, and what do “horizontal” and “vertical” mean here?\

By glenn mcdonald on October 1, 2009 5:27 PM

Oh, and what makes this idea inherently a forward-chaining solution? Seems to me that structurally you can evaluate your query for all data ahead of time, or for individual nodes when asked. There are various reasons to want one or the other in particular uses, but I don’t immediately see how the expression of the rule in a query-language is better or worse suited for one direction.

By Sergio Fernández on October 2, 2009 2:51 AM

Check:

Axel Polleres. From SPARQL to rules (and back). In Proceedings of the 16th World Wide Web Conference (WWW2007), pages 787-796, Banff, Canada, May 2007. ACM Press. Extended technical report version available at http://www.polleres.net/TRs/GIA-TR-2006-11-28.pdf, slides available at http://www.polleres.net/publications/poll-2007www-slides.pdf.\

By Diana Roberts on October 12, 2009 11:38 AM

But isn’t it a problem that SPARQL rules, once immortalized, define a reality that can easily become outdated? The example you used here is a case in point.

Several of my younger married friends in fact do not share the same telephone number because they don’t have a land line at home and instead use their own mobile numbers. What mechanisms are there or could there be to make sure that the semantic web keeps up with the changing times?

By Bob on October 12, 2009 12:45 PM

If you’re concerned about potential issues around the mapping of reality to a rule set–a perfectly reasonable concern–then you want to avoid rule set specifications that rely on binary formats (e.g. compiled code) or proprietary languages from vendors who offer rule set apps using rule languages that they made up themselves. SPARQL scores very well on both counts, being a W3C standard.

Because the SPIN approach treats the SPARQL queries as more data to manage, the addition, removal, and modification of the rules is as straightforward as doing so with the data it’s querying. There’s no need to immortalize anything.

In fact, the semantic web model is often more adaptable than others because of the greater ease of schema modification than you’ll find with relational databases or XML.

So, a SPARQL-based system can do a fine job of keeping up with the changing times.

Converting wpl playlists to m3u playlists

Bob DuCharme — Fri, 18 Sep 2009 22:25:15 -0500

Simple XML in, simple text out, but no good search results for wpl2m3u? Write a little XSLT.

After taking a closer look at the WPL format I realized that an XSLT stylesheet to convert it to M3U would be very simple.

I’ve switched around between music-playing programs over the last few years. I suppose I should call them “media players”, but I only use them to play music, which is part of the reason I ended up using Songbird, an open source Windows/Linux/Mac music front end that doesn’t pretend to be anything else. It looks a bit like iTunes, without all the ads in your face; how great is that?

Before that I used MediaMonkey, and before that, the Windows Media Player. Guess which of these uses the most standardized, XML-based format for playlists? Surprise: the Microsoft one.

Windows Media Player can create WPL files, which seem to conform to the W3C SMIL standard, and it can export M3U files, which MediaMonkey uses. To convert WPL files to m3u for Songbird, reading them individually into Windows Media Player and exporting them one at a time was annoying. I did some web searches for wpl2m3u and only found one script that I couldn’t quite follow, and after taking a closer look at the WPL format I realized that an XSLT stylesheet to convert it to M3U would be very simple. So here it is:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">


  <xsl:strip-space elements="*"/>
  <xsl:output method="text"/>


  <xsl:template name="textAfterLastSlash"><!-- but actually backslash -->
    <xsl:param name="string">dummy string</xsl:param>
    <xsl:choose>
      <xsl:when test="not(contains($string,'\'))">
        <xsl:value-of select="$string"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:call-template name="textAfterLastSlash">
          <xsl:with-param name="string" select="substring-after($string,'\')"/>
        </xsl:call-template>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>


  <xsl:template match="smil">
    <xsl:text>#EXTM3U&#10;</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>


  <xsl:template match="media">
    <xsl:text>#EXTINF:0,</xsl:text>
    <xsl:call-template name="textAfterLastSlash">
      <xsl:with-param name="string" select="@src"/>
    </xsl:call-template>
    <xsl:text>&#10;</xsl:text>
    <xsl:value-of select="@src"/>
    <xsl:text>&#10;&#10;</xsl:text>
  </xsl:template>


  <xsl:template match="title"/>


</xsl:stylesheet>

It’s not very long, but if you want fancy XSLT, I have a recursive named template, which I wrote for something else but modified here to look for the text after the last backslash. The 
 is a trick I’ve used more lately to get XSLT to output a carriage return, because if I put an actual carriage return inside of an xsl:text element like I always did before, telling Emacs to re-indent the whole thing tends to screw that up.

With a long plane ride tomorrow night to go to Oxford for the XML Summer School, I want to load up the MP3 player with something conducive to sleeping, so I just converted my playlist of Lata Mangeshkar ballads so that I can put that on. (If you like classic Bollywood soundtracks, check out Music from the Third Floor; if you’re new to it and interested, start with the compilations there.)

1 Comments

By V1AN1 on May 13, 2011 7:08 PM

This was very helpful!

Here are some instructions to use this sucker:

Download the Saxon HE XSLT converter:
http://sourceforge.net/projects/saxon/files/Saxon-HE/9.3/saxonhe9-3-0-5j.zip/download

Now create a folder called wpltom3u (or of your choosing) and go into that folder.
Create two additional folders, one titled wpl and another titled m3u.
Now, put the .jar of Saxon XSLT into the default (wpltom3u) folder. MAKE SURE YOU HAVE JAVA INSTALLED!
Now create a text file in the default (wpltom3u) directory and rename it to style.xsl, then edit it with notepad and paste in the XSLT code posted above.
Create another file and rename it to convert.bat.
Now edit convert.bat and put in this following code:

for %%a in (wpl/*.wpl) do java -jar saxon9he.jar “wpl/%%a” “style.xsl” >“m3u/%%~na.m3u”
pause

Save it and exit. Now put all your .wpl playlist files into the wpl folder, and hit Convert! PRESTO! You now have all your wpl playlists converted to m3u. :D

It will take a while to convert it automatically and to rename them correctly but you wont have to do anything but double click and wait. :)

Hope this helps people..

Ah, and a side note:
some playlists fail to convert due to their names. If they do, write down their names and then rename them and put them back into the wpl folder. Don’t reconvert all your playlists again, this code isn’t that smart. Just single out those playlists that didn’t convert and leave them in the wpl folder, then rename them to have no symbols in them, and convert!

Cheers!

Appreciating SPARQL CONSTRUCT more

Bob DuCharme — Wed, 09 Sep 2009 19:33:39 -0500

Another way to get more out of your data.

As with SQL, SPARQL’s most popular verb is SELECT. It lets you request the data you want from a collection, whether you’re asking for a single phone number or you want a list of first and last names and phone numbers of all employees hired after January 1st, sorted by last name.

CONSTRUCT provides a nice example of how SPARQL is more than a query language; along with extracting data using queries, you can create useful new data as well.

In SPARQL, SELECT is actually known as a query form, and another is CONSTRUCT. According to the SPARQL Query Language for RDF W3C Recommendation, CONSTRUCT returns a graph—a set of triples. I had thought of CONSTRUCT as a way of pulling a set of triples out of a triplestore, especially a remote triplestore, but while reviewing some TopQuadrant training material I realized how handy CONSTRUCT can be to create useful new triples.

For example, let’s say you have the following triples written in Turtle syntax to identify the gender and parent/child relationships of a few people:

@prefix : <http://www.snee.com/ns/demo#> .


:jane :hasParent :gene .
:gene :hasParent :pat ;
      :gender    :female .
:joan :hasParent :pat ;
      :gender    :female . 
:pat  :gender    :male .
:mike :hasParent :joan .

The following CONSTRUCT statement creates new triples based on the ones above to specify who is who’s grandfather:

PREFIX : <http://www.snee.com/ns/demo#> 


CONSTRUCT { ?p :hasGrandfather ?g . }


WHERE {?p      :hasParent ?parent .
       ?parent :hasParent ?g .
       ?g      :gender    :male .
}

When I ran this query with the data above, ARQ returned the newly constructed triples in Turtle format:

@prefix :        <http://www.snee.com/ns/demo#> .


:jane
      :hasGrandfather  :pat .


:mike
      :hasGrandfather  :pat .

From the same little data file, we can generate triples about who is who’s aunt:

PREFIX : <http://www.snee.com/ns/demo#> 


CONSTRUCT { ?p :hasAunt ?aunt . }


WHERE {?p      :hasParent ?parent .
       ?parent :hasParent ?g .
       ?aunt   :hasParent ?g ;
               :gender    :female .


FILTER (?parent != ?aunt)  
}

With this query, ARQ constructs these triples:

@prefix :        <http://www.snee.com/ns/demo#> .


:jane
      :hasAunt      :joan .


:mike
      :hasAunt      :gene .

This isn’t really creating new information, but the ability to make implicit information explicit can certainly add value to a system, especially when the rules necessary to assemble the pieces are more complicated than the ones shown above for identifying grandfathers and aunts.

How you use your newly constructed triples depends on how your SPARQL engine gives them to you. As we saw above, ARQ writes them out in Turtle syntax. TopQuadrant’s TopBraid Composer displays them in the window used for SPARQL query output, and after you select one or more of them, the “Assert selected constructed triples” menu choice adds them to the graph of triples that you’re currently working with. (This works in the free edition as well.)

CONSTRUCT provides a nice example of how SPARQL is more than a query language; along with extracting data using queries, you can create useful new data as well.

4 Comments

By Keith Fahlgren on September 10, 2009 1:04 AM

I also started turning to CONSTRUCT recently as a performance optimization. Rather than having to ask the server to build the “normal” huge serialization that the libraries expect, I just plucked out a tiny subset that I needed (and didn’t cross too many internal graph storage boundaries) and asked for a CONSTRUCT of that. The speedup wasn’t as huge as I’d hoped, but it was still a fruitful exercise.

By Simon Reinhardt on September 10, 2009 4:12 AM

You can see some more examples for the usefulness of CONSTRUCT for things like rules and views at http://spinrdf.org/spin.html and at http://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/Research/systeme/NetworkedGraphs .

By Bob DuCharme on September 10, 2009 9:28 AM

Believe me, I’ve been studying spinrdf.org for a few weeks now–it’s part of my job!

By Daniel Mekonnen on September 13, 2009 5:27 PM

Congratulations Bob, I think you are now seeing “the stars in the obelisk” to use a 2001 analogy (in the Arthur C. Clarke sense, not the actual year :).

In my own experience with SPARQL I actually use CONSTRUCT, INSERT and DELETE more than SELECT. Which is very much a part of the process of semanticizing unlinked data sets from raw sources like Excel files, XML, XSD, CSV, RDBMS sources and text dumps from PDF files. TopBraid Composer can import most anything that you can point a URL at and bring into a semantic representation.

But that’s just the starting point. The semantic representation that you get from the many import features are not necessarily going to be in the vocabulary that you are required to work with on a given project. This is where CONSTRUCT and friends come to the rescue, to transform the triple patterns form into another. SPARQLMotion brings the process to another level, allowing you to pipeline a series of transformations together, even merge multiple sources together, into the representation that you need.

Very powerful stuff. I view the process as “data shaping” and the equivalent of “s/pattern A/pattern B/” from the regex world. I think people who enjoy writing regular expressions will find SPARQL very enjoyable, kind of like going from 1D to 2D pattern matching.

Growth of the linked data cloud

Bob DuCharme — Thu, 03 Sep 2009 09:19:24 -0500

Or at least, the growth of Richard Cyganiak's famous diagram.

While preparing slides for the Semantic Web Overview talk I’ll be giving at the beginning of the Semantic Technologies course of the Oxford XML Summer School, I was adding a few slides on Linked Data. (Leigh Dodds is presenting a more detailed class on Linked Data later in the day.) Of course I had to include a slide of Richard Cyganiak’s interactive diagram of the Linked Data cloud, and as with many of my slides, I was tempted to re-use a slide from a presentation I’d given before. I found the following image in a talk I gave in February of last year:

I decided to be conscientious and update the image, so I went to Richard’s page to get an updated version, and found this:

It looks like the world of linked data is growing at quite a rate! And, if you look closely, you’ll see that his latest image says “As of July 2009”, so I imagine that there are even more nodes to add to this image by now.

Getting started with the TopQuadrant product line

Bob DuCharme — Thu, 27 Aug 2009 20:38:34 -0500

A lot of great technology to learn about.

Last week was my first week working at TopQuadrant, and I spent three days in a class given by one of my new co-workers, Scott Henninger. I only had a skeletal idea of what the components of TopBraid Suite did before, and now that I have a better idea, I’m very impressed. (I may be wrong on one or two details below, but I’m still the new guy.)

I had the impression that the core original product, TopBraid Composer, was mostly for designing and editing RDFS/OWL schemas and ontologies. It’s very good at doing those, but it also makes a very good interface for dealing directly with the data described by these models. Being built on Eclipse, the various panes (or, in Eclipse parlance, “views”) of the main window let you see an ontology or a file of data from several angles at once and refine the model by pointing, clicking, dragging, and editing dialog boxes.

TopBraid Composer also includes a SPARQL engine and uses standard SPARQL as the starting point for several new technologies that let you build applications around triplestores. A great new one is SPIN, for “SPARQL Inferencing Notation”. As described on spinrdf.org,

SPIN is a collection of RDF vocabularies enabling the use of SPARQL to define constraints and inference rules on Semantic Web models. SPIN also provides meta-modeling capabilities that allow users to define their own SPARQL functions and query templates. Finally, SPIN includes a ready to use library of common functions.

As TopQuadrant VP of Product Development Holger Knublauch wrote in a comment in a recent William Vambenepe blog entry,

Another aspect of RDF that SPIN rides on is the vision of a distributed self-describing data structure. In the Semantic Web, both classes and instances live in the same space and can be queried using the same mechanisms. SPIN takes this idea to extremes: you can not only define classes and properties, but even define executable semantics of those and use this mechanism to build your own modeling languages.

Holger’s own blog entry The Object-Oriented Semantic Web with SPIN is a good introduction to what SPIN (and TopQuadrant’s implementation of those executable semantics, TopSPIN) are all about. With support for SPIN built into the free edition of TopBraid Composer, a lot of people can now try this out, and I look forward to helping the company beef up the documentation for it.

SPARQLMotion is another impressive RDF application development productivity tool. SPARQLMotion lets you build applications by dragging icons into a screen where you connect them into pipelines that can branch in different directions depending on various conditions. You configure each icon by filling out a dialog box to point to data sources, data destinations, and processing modules. Input modules represented by different icons can pull data from news feeds, spreadsheets, email, XML, all the obvious RDF sources, and more. Processing can apply rules via Pellet, Jena rules, TopSPIN, Calais, XSLT… that’s about a quarter of the list. You can then output the results of your processing to most of the input formats and additional ones such as calendars, maps, and HTTP POST requests. (PDF support is on the way, so that you could have the XSLT processing module convert XML versions of data pulled by some SPARQL queries into XSL-FO of a nicely rendered page and then output a PDF file from there.) Holger has done a nice five-minute video called “Creating a SPARQLMotion Script” on TopQuadrant’s video page.

After TopBraid Composer, the other two components of TopBraid Suite are Ensemble and Live. TopBraid Ensemble lets you create applications by selecting user interface components and essentially writing event handlers for them. Components for displaying data include trees, grids, and forms, so a dashboard app would be pretty straightforward to build. Because it’s built on Adobe Flex, you can create any Flex component you want, such as a movie player, and then use the Ensemble API to grab triples and use them in processing. (I never realized that a running copy of Eclipse has an HTTP server that you can use as the basis for applications.) Because the UI that you design can trigger manipulation of the data in a triplestore using SPIN and SPARQLMotion, you can build complete applications around triplestores for use by people who may not even know what RDF is but who need to work with that data using a form-driven interface.

Once you build an application with Ensemble, TopBraid Live lets you deploy it on a server for others to use. I saw Scott help a customer deploy an app, and the process pretty much looked like zipping up some files and then unzipping them to the right place on a server that the app’s users would have access to.

With SPARQLMotion as a development tool and TopBraid Live as a deployment tool, it’s easy to picture an information publisher having staff members who do nothing but full-time SPARQLMotion development, creating apps that mix and match data from all the different data sources available to that publisher in order to build information products and applications around those data sources. (The data might be available as native RDF, but would more likely be in a host of other formats available to the SPARQLMotion scripts using its automatic converters.) Using TopBraid Live, the publisher would use these apps to deliver content in any format necessary to their customers. The publisher would have an agile platform for creating new information products whose components may have started off in separate silos and would have taken a lot more work to integrate without TopBraid Ensemble. Of course, there’s more to it than the easier integration provided by the RDF data model; the possibilities that RDFS, OWL, and now SPIN provide for adding metadata to the content should be very attractive to publishers as well.

DevX article on using RDFa with DocBook and DITA

Bob DuCharme — Fri, 21 Aug 2009 07:18:04 -0500

Relatively easy.

RDFa specs and discussions usually tell us that you’re not limited to using it with HTML, and then they only talk about using it with HTML. I wanted to see how difficult it would be to incorporate the examples from the W3C’s RDFa Primer into DocBook and DITA documents, and it wasn’t difficult at all. DevX has just published an article that I wrote on how I went about it: Using RDFa with DITA and DocBook.

I used to think that DTDs such as DocBook and DITA didn’t need RDFa because they’re so customizable, but when you define new metadata elements or attributes for either, you have to write new code in XSLT (or whatever your language is for processing the XML) to go get that new metadata. Once you’ve added RDFa support modules to these DTDs, though, existing RDFa extractors (that conform to the spec) will pull out any new kinds of metadata that you store in these RDFa attributes, which also reduces your need for future customization of those DTDs.

I hope that people in the RDFa, DocBook, and DITA communities find the article useful.

1 Comments

By dret on August 21, 2009 1:30 PM

definitely interested, but http://www.devx.com/semantic/Article/42543 currently serves a standard frame with blank contents, so maybe that’s something DevX should look into?

Joining TopQuadrant

Bob DuCharme — Fri, 14 Aug 2009 12:40:00 -0500

Doing semantic web work full-time at an industry leader in the field.

I’m very pleased to announce that on Monday I’ll be starting a full-time position with TopQuadrant, a well-known name in the semantic web world. TopQuadrant makes TopBraid Suite, the W3C standards-based desktop tool for modeling data and for developing and deploying applications that take advantage of semantic web and linked data technology.

Of all the activities that their staff is engaged in—development, modeling, training, documentation, speaking, marketing, sales—I could be taking taking part in all of them, so while we haven’t thought of a job title yet, it will probably be one of those classically nebulous ones that tech company employees favor. (I have told them that if Dean Allemang is Chief Scientist, I’d love to have something with the word “Scientist” in it.)

Three things in particular attracted me to TopQuadrant: they have a great track record of applying semantic web technology to customer business goals, they make a fully functional version of one of their core products available for free, and they’re firmly committed to the support of open standards—there’s no patented secret sauce holding up their business model. Of course, these three points overlap a great deal; for example, as the term “semantic” gains in marketing buzzword status, more companies claim that their products use semantic technology, but they never mention RDF, OWL, or SPARQL, while TopQuadrant is actively engaged in helping customers to build applications that use these standards and that hide the syntax from the customers’ end users behind modern GUI interfaces when necessary.

I’m looking forward very much to learning, using, teaching, and building on these tools. (And if you’re interested in applying XML technologies to publishing systems, Innodata Isogen just might be interested in hiring you.)

8 Comments

By Michael Friedman on August 14, 2009 1:02 PM

Congratulations on the new position! It sounds like a great fit for you. Please do keep us informed!

By Betty Harvey on August 14, 2009 1:14 PM

Congratulations and good luck! I know you enjoy this new position and bring a lot of good work to the semantic web.

By John Cowan on August 14, 2009 2:26 PM

Suggestions:

“Other Scientist”

“Grunt Scientist”

“Indian Scientist” (as in “chiefs and Indians”)

“Grad” (from Larry Niven’s Smoke Ring books)

“Chief Natural Philosopher”

This all reminds me of how in the Anglican Church the Archbishop of York is the Primate of England, whereas the Archbishop of Canterbury is the Primate of All England. Of course, they are both primates, so that’s only reasonable.

By Taylor on August 14, 2009 5:56 PM

This is good for you and TQ…really pleased to hear this, now you’re going to have to dig into Jena a bit more, maybe some Jython bindings on the way?

By Brian Manley on August 14, 2009 9:27 PM

Congrats Bob! I interviewed with them a few weeks ago, and found them to be a really interesting and smart group. Hope it works out well for you!

By Michael Hausenblas on August 15, 2009 6:56 AM

Bob,

Congrats! Sounds very cool; hope you still find time to continue your great posts, here.

Cheers,
Michael

By Pat Hayes on August 16, 2009 12:10 AM

Congrats. On titles, I would go for “natural scientist”, which has a venerable history but also carries a subtle implication about all the other scientists :-)

By Peter Ring on August 16, 2009 6:55 PM

Congratulations! and keep up posting!

I’d love to have a business card that read ‘Evil Mad Scientist’ or just ‘Evil Genius’. For now, I have to settle with ‘Information Architect’. Maybe I should start wearing a white coat …

Advanced XSLT (and XQuery, and XSL-FO)

Bob DuCharme — Thu, 30 Jul 2009 11:28:49 -0500

A good place to learn.

While there are many places for beginners to learn the basics of XSLT (I have my favorite, but I’m biased), learning more advanced techniques often means spending a good deal of money to bring in a specific expert for specialized training.

For the eight or so years that I’ve chaired the XSLT/XSL-FO/XQuery track at the XML Summer School in Oxford, England, I’ve begun it by giving a beginner-level XSLT introduction class, but with people like Michael Kay, Jeni Tennison, and Priscilla Walmsley on hand (and that’s just the XSLT track—other tracks include XSLT experts such as Debbie Lapeyre and Norm Walsh), I’ve always wanted to take better advantage of the opportunity to address more advanced, cutting-edge issues. So this year, when the Summer School is held from the 20th to the 25th of September, I won’t give an introductory class as part of this track; Debbie will cover the introductory level material in the Hands-on Introduction to XML course, leaving the XSLT, XSL-FO, and XQuery track more room to cover material that helps current XSLT practitioners build better applications more quickly. (I will be doing the introductory class in the Semantic Technologies course, which I’m greatly looking forward to.)

For example, one new class aimed squarely at current XSLT practitioners is Jeni’s “Test-driven XSLT development”. Different programming and scripting languages present difference advantages and challenges for the use of unit tests in application development; Jeni has worked out a framework to let this approach benefit XSLT development, and will show us how it has helped her work.

Another new class will be Michael Kay’s discussion of application architecture. Standards-based XML development often offers several ways to do the same task within a system (for example, XSLT vs. XQuery or native XML databases vs. relational ones) and Michael’s perspective on how to go about making the right choices will be very interesting, given his extensive development experience with both the inner workings of his Saxon XSLT processor and with system development for clients.

A more large-scale class that I really look forward to is the XSLT Efficiency Workshop, a session that will last an entire afternoon. This will be led by Michael, Jeni, Priscilla, and myself, and cover both development efficiency and execution efficiency. Instead of simply taking turns presenting slides, we have an interactive session planned, in which we work with subsets of the workshop’s attendees to identify the most pressing issues in their own development, and then I and this all-star panel will discuss approaches to these issues.

Classes from previous years that will be offered again are Jeni’s “Getting the Most Out of XSLT 2.0” and Priscilla’s introductions to XQuery and XSL-FO. I’ve seen these presentations several times and learn something new each time.

As Jeni wrote in her weblog recently,

I know a lot of beginners go to the XML Summer School for the introduction course, but to me the real value is for people who are actually using XML on a day to day basis and want to keep on top of the latest tools and technologies that will actually help them do their jobs. I learn something new every year.

We hope that the Efficiency Workshop in particular brings the advanced expertise together with stories of day-to-day use to help the practitioners learn new techniques and the experts learn more about how people are using these technologies in a range of real-world applications.

And of course there’s Oxford itself, and the beer, and the hanging out with friends, which can be even better than the classes, but it’s difficult to distinguish between the classes and the hanging out in the college bar when these old and new friends know so much about XML technology and enjoy discussing their work.

2 Comments

By nicky on August 10, 2009 2:54 AM

Nike AIR Force

Air jordan IV

By jope on November 30, 2010 3:19 PM

Hey,

I’ve learned XSLT and FOP by tutorials googled on the web and about 2 books. A video guide would be also a nice learning training, because of the replay function.

Greets

Court decision metadata and DBpedia

Bob DuCharme — Mon, 27 Jul 2009 09:05:41 -0500

An unplanned sequel.

When I wrote my last two blog entries (not counting the announcement about my new developerWorks article), Modeling your data with DBpedia vocabularies and Big legal publishers and semantic web technology, I had no idea that I would soon stumble across a nice collection of US Supreme Court case metadata in DBpedia. After writing about modeling with DBpedia vocabularies, it occurred to me that if Wikipedia has pages with infoboxes for individual professional wrestlers and Battlestar Galactica episodes, they probably have them for important Supreme Court cases as well. I checked for Roe v. Wade (popular in legal publishing because along with being a famous case, its title is short and easy to spell) and there it was at http://en.wikipedia.org/wiki/Roe_v._Wade. Even better for the semweb geek, its DBpedia page at http://dbpedia.org/page/Roe_v._Wade showed properties for most of the key bits of information you want for a court decision: the date, the reporter volume and page, names of concurring judges, names of dissenting judges, laws applied, and more.

Wikipedia and DBpedia even include my favorite case, Campbell v. Acuff-Rose Music, Inc., in which Appendix B of the Supreme Court decision includes the following lyrics from the 2 Live Crew song that Roy Orbison’s publisher sued “Luther Campbell aka Luke Skywalker” (as he’s known in the case’s dbprop:fullname) over: “Big hairy woman all that hair it ain’t legit/‘Cause you look like ‘Cousin It’”. (I like my landmark Supreme Court IP law decisions to include Addams Family references.)

Wikipedia currently has pages for 198 Supreme Court decisions, according to their Category:United States Supreme Court cases page. After going to the DBpedia equivalent of that page, I realized that I could retrieve a list of them all with a simple SPARQL query on DBpedia’s query form:

SELECT DISTINCT ?s WHERE {
  ?s 
  <http://www.w3.org/2004/02/skos/core#subject>
  <http://dbpedia.org/resource/Category:United_States_Supreme_Court_cases>
}

Even better, I noticed at the bottom of the Wikipedia page for Campbell v Acuff-Rose that it belonged to the Wikipedia category US copyright case law, a pretty important bit of categorization metadata. Sure, you can look at that page to see the list, but you can also retrieve the list with a slight modification to the SPARQL query above:

SELECT DISTINCT ?s WHERE {
  ?s 
  <http://www.w3.org/2004/02/skos/core#subject>
  <http://dbpedia.org/resource/Category:United_States_copyright_case_law>
}

The most interesting part of the metadata included with the cases is the connections between them. For example, the DBpedia page for Brown v. Board of Education shows that it “is dbpprop:overruled of” Plessy v. Ferguson. The DBpedia page for Plessy v. Ferguson shows that it’s dbprop:overruled by Brown v. Board of Education.

There are not enough of these links to threaten a commercial cite-checking service such as LexisNexis’s Shepard’s product—a lawyer checking whether a potentially citable case was has been overruled is a classic example of when search recall trumps precision, because missing just one search result can be disasterous for the lawyer. Still, the current amount of SPARQL-addressable fielded metadata about US caselaw on Wikipedia (and hence on DBpedia) is a big step beyond the amount of law metadata on the web that was available when I wrote about this in early 2006. It will be great to see this collection grow and to see more applications take advantage of it.

New developerWorks article: "Build Wikipedia query forms with semantic technology"

Bob DuCharme — Wed, 22 Jul 2009 10:07:10 -0500

Build form-driven apps that let any user query DBpedia.

I often find discussions about whether SPARQL is difficult to be a bit silly. Not that SPARQL is incredibly easy—although I do find it easy enough as query or scripting languages go—but because any talk of its suitability for your Mom is just a red herring. In a January blog posting titled Hey CNN, SPARQL isn’t so difficult, I wrote that as with SQL and other query languages, no one expects end users to type out SPARQL queries, but that someone who already knows a scripting language or two can pick up SPARQL and use it to build new kinds of applications.

I’ve written an article titled Build Wikipedia query forms with semantic technology to demonstrate how, and it’s now live on IBM’s developerWorks website. The article walks the reader through the components of two simple form-driven applications: one that queries the Internet Movie Database for the names of actors who have appeared in movies by the two directors whose names you enter on the input form (for example, only Kathleen Turner has appeared in both a Francis Ford Coppola film and a Sofia Coppola film), and another that retrieves nicely-formatted information about recording artist albums from DBpedia. I tried to make it clear in the article that the cool part of all this is not this relatively new query language, but the existence of these collections of data that can be accessed by a standard query language and the ease with which one can build a query around search terms entered on a form, then send the query off to the server, and then format and display the results in a web page—just as people have done with SQL queries for years now.

I hope I got those points across, and that the article helps more people understand the contributions that SPARQL and linked data can make to useful applications that couldn’t exist otherwise.

Modeling your data with DBpedia vocabularies

Bob DuCharme — Sat, 18 Jul 2009 09:58:37 -0500

Broad, useful, vocabularies with plenty of sample data.

I’ve known for a while about ways to dig into the vocabularies used in DBpedia’s massive collection of triples, and I’ve used terms from these vocabularies to query for information such as Bart Simpson blackboard messages and US presidents’ ages at inauguration. I saw these terms as “field” names to use when querying this body of data.

Reading the W3C RDFa spec recently, though, I was struck by one example:

<div about="http://dbpedia.org/resource/Albert_Einstein">
  <span property="foaf:name">Albert Einstein</span>
  <span property="dbp:dateOfBirth" datatype="xsd:date">1879-03-14</span>
  <div rel="dbp:birthPlace" resource="http://dbpedia.org/resource/Germany">
    <span property="dbp:conventionalLongName">Federal Republic of Germany
   </span>
  </div>
</div>

This particular example demonstrates how to chain statements together with shared resource references, but what caught my eye was the use of the http://dbpedia.org/resource/ namespace to reference Albert Einstein and Germany and the http://dbpedia.org/property/ namespace (here represented as “dbp:”) for the factual property “birthPlace”. In other words, here were two DBpedia vocabularies being used not to query DBpedia, but to model data completely outside of the context of DBpedia, because they offered straighforward, dereferencable URIs for these things.

I’m not saying that these are the first vocabularies to check when you need URIs for people, places, concepts, or properties, but they could be the best second or third places to go to if your domain offers no clear choice for a vocabulary that meets your needs. For example, I’d prefer the Linked Movie Database URI of http://data.linkedmdb.org/page/film/2674 for Truffaut’s film “Shoot the Piano Player” over DBpedia’s http://dbpedia.org/resource/Shoot_the_Piano_Player, despite the latter’s greater readability, because for one thing, the linkedmdb.org page for Shoot the Piano Player includes data about this resource being owl:sameAs the resource http://dbpedia.org/resource/Shoot_the_Piano_Player, making it easy for queries about this movie to tie the Linked Movie Database and DBpedia metadata together. The more important reason, though, is that as far as I can tell, the Linked Movie Database project team has worked out a specific property vocabulary as part of their project, while the DBpedia one has grown more organically, leading to many more strange edge cases among the well-chosen terms.

While the Library of Congress Subject Headings provide a solid, professional taxonomy and a set of URIs for a wide variety of subjects and concepts, they don’t have them for places or people. (They might have one for London (England)–History, but they don’t have one for “London (England)”.) So, while they have a URI for the concept of sightings of Elvis Presley since his death, they have no URI for Elvis himself. Nor do they have one for Einstein, and I don’t know what well-known vocabulary does, so the RDFa spec’s authors went with the DBpedia URI for the famous physicist. (Interestingly, the Library of Congress Subject Headers do cover fictitious characters such as Holden Caulfield and even the Simpsons’ Comic Book Guy.)

To describe facts about Einstein, the FOAF vocabulary includes many good properties for describing a person, but none to identify the day a person was born, so the RDFa spec’s authors used the DBpedia http://dbpedia.org/property/dateOfBirth property. It’s easy enough to check whether DBpedia has a URI for a person, place, or thing by going to the appropriate Wikipedia page (watch out for redirects) and replacing the http://en.wikipedia.org/wiki/ part of its URI with http://dbpedia.org/page/. I have a bookmarklet called wp -> dbpedia that makes this replacement and takes me from a Wikipedia page to the corresponding DBpedia page with one click. If you drag that link to your bookmarks toolbar, it should work for you.

To look for a property name you might need, you can check a DBpedia page for a resource that may have had that property assigned to it. You can also download an ntriples or csv file in your choice of 14 languages from DBpedia’s Download Page. The compressed version of infoboxproperties_en.nt, the ntriples version of the English language properties, was 606K, which decompression expanded to over 13 megs. With two ntriples per property, as shown in their brief sample of the file, it’s pretty verbose, so I wrote a perl script to trim it down to just one property name per line, without the full URLs, bringing the size of the list down to 49,122 lines and about 879K.

The list fun to skim through. There are a lot of goofy properties in there; worldSnookerChampionshipRoundsProperty99 has 98 more to go with it. So how do you know which ones are worth using? I like metadata that’s really about existing data, and it’s easy to use DBpedia’s SPARQL query form to ask about resources that have a particular property assigned. Entering the following query there showed me that over 50 people have had worldSnookerChampionshipRoundsProperty99 values assigned to them:

SELECT DISTINCT ?s ?o WHERE {
  ?s 
  <http://dbpedia.org/property/worldSnookerChampionshipRoundsProperty99> 
  ?o
}

Seeing examples of how a property was used also gives you great background in whether it’s appropriate to your needs.

The first place I’d check, though, for appropriate DBpedia property names would be the DBpedia Ontology available from the same download page. It’s not huge, defining metadata for about 1200 properties at this point, but it really brings the property vocabulary into ontology territory by defining domains, ranges, subclasses, and other relationships between terms that help you to get more out of them. Outside of that ontology, plenty of other hard work continues to make the DBpedia predicate vocabulary more valuable to all of us, so it’s worth keeping an eye on the work going on around this vocabulary.

2 Comments

By Ryan Shaw on July 18, 2009 1:23 PM

While the Library of Congress Subject Headings provide a solid, professional taxonomy and a set of URIs for a wide variety of subjects and concepts, they don’t have them for places or people.

While this is true, the Library of Congress does have authority files for those things, and I understand they plan on adding them to id.loc.gov as Linked Data soon.

Einstein: http://errol.oclc.org/laf/n79-22889.html

Great post!

By Michael Hausenblas on July 19, 2009 2:47 AM

Bob,

Great article (as usual ;) and might be worth it for us to cover this aspect in [1].
Thanks!

[1] http://ld2sd.deri.org/lod-ng-tutorial/#checklist

Big legal publishers and semantic web technology

Bob DuCharme — Mon, 15 Jun 2009 15:10:03 -0500

Which one will see the good fit first?

A recent @TopQuadrant tweet about legal knowledge and RDF/XML led me to Dr. Adam Wyner’s piece Legal Ontologies Spin a Semantic Web on law.com. After reading it, I wanted to leave a comment, but this required registering on law.com and telling them lots of details about the law firm I work for. I don’t work for a law firm, so I’m just putting my comments here and expanding on them a bit.

It's a logical next step for the big legal publishers to build ontologies that define new kinds of relationships among the data that they store.

Before discussing the value that ontologies can bring to the practice of law, Dr. Wyner writes:

Reading a case such as Manhattan Loft v. Mercury Liquors{.linelink}, there are elementary questions that can be answered by any legal professional, but not by a computer:

Where was the case decided?

Who were the participants and what roles did they play?

Was it a case of first instance or on appeal?

What was the basis of the appeal?

What were the legal issues at stake?

What were the facts?

What factors were relevant in making the decision?

What was the decision?

What legislation or case law was cited?

Legal information service providers such as LexisNexis{.linelink} index some of the information…

Actually, they identify and index most of the information in this list, as do Westlaw and the Wolters-Kluwer legal publishers, because they store the majority of their content in XML. (As early adopters of this technology, these companies sometimes store it using XML’s predecessor, SGML.) A case’s venue, its participants and their roles, the facts of the case, and the judge’s decision are typical pieces of information that a legal publisher identifies with XML markup and stores in a system that can use this information for specialized queries.

Ontologies can add a lot to this, and the schemas for this XML will be a great head start to any semantic web-oriented system for getting more out of this data. This won’t happen outside of the publishers’ firewalls soon, though, because the schemas for their legal content play such an important role in the extra value that they add and charge for that no legal publisher would share them. (They don’t worry about open source efforts to reproduce their work nearly as much as they worry about competitive advantages over each other.)

Two other resources that these publishers can build on are their existing taxonomies and their databases of citation relationships. Taxonomies such as West’s Key Number system are divided by practice areas (for example, asbestos construction issues vs. child custody) and not document roles or purposes, and therefore make a nice complement to the XML schemas. Legal publishers have sold databases of citation relationships (for example, which case overruled another one) since the nineteenth century, and this data is all in clean, well-organized databases.

Kingsley Idehen likes to discuss how relational databases added a level of abstraction over previous models, XML provided an additional layer of flexibility by enabling people to store and use structured data whose structure wasn’t necessarily tables, and the RDF data model and associated technology add another layer of abstraction and therefore more possibilities. Behind their firewalls, it’s a logical next step for the big legal publishers to build ontologies that define new kinds of relationships among the XML content, the relational citation information, and the taxonomy data that they currently store so that they can get more value out of this data.

While there are cool things to do with this technology using content such as ancient literature, it’s much easier to see a business model in a domain such as legal publishing where customers have a bigger budget to spend on information that can help them do their jobs. Making a case for the return on semantic web technology investment for legal publishing will be an interesting challenge, but not too difficult, because these technologies can build incrementally on so many existing information resources such as relational databases and the XML content infrastructure that Dr. Wyner forgot to mention. It will be interesting to see which of the big legal publishers moves ahead with this first, although they may choose not to publicize it.

For work outside of the big legal publishers, in a 2006 posting titled Law metadata on the web I wrote about how legal-rdf.org looked like a good start, but apparently there’s been little enough activity there that they let their domain name ownership lapse, and now it’s just parked by a speculator. (That posting also mentions the OASIS LegalXML work, which hasn’t gotten to defining schemas for court decisions and kind of petered out in defining schemas for legislation, the other main document type for legal publishing.)

Can anyone tell me of other public standards for legal metadata in development that could provide input to semantic web projects?

1 Comments

By Irene Polikoff on June 21, 2009 12:14 AM

Bob,

I was smiling as I read your post. The future is actually even closer than you may think.

Can’t name any names as it would not be appropriate, but a case study on this page http://www.topquadrant.com/solutions/ent_vocab_mgmt.html is based on our work with one of the large legal information publishers mentioned in your blog. Representatives from the other large publisher you name spent quite a bit of time last week at our booth at the Semantic Technologies conference.

Not rich ontologies so far, just taxonomies, but it is happening as we speak (or write, for that matter). Publishing is going RDF.

SearchMonkey and RDFa

Bob DuCharme — Tue, 02 Jun 2009 20:15:24 -0500

What am I missing?

Yahoo! SearchMonkey is one of those interesting, RDF-related technologies that I’d been meaning to check out for a while, and when I saw how much of the reaction to Google’s Rich Snippets was people like Ryan Smith or Peter Mika in the May Semantic Web Gang podcast saying that Google was just doing what SearchMonkey had already done, I knew that it was time to look more closely at SearchMonkey.

I wanted to see support for RDFa embedded in HTML, and to be honest, I only see it in SearchMonkey if I squint while I’m looking and tilt my head slightly sideways. Perhaps I’m missing something, and I hope someone points it out to me.

According to the Site Owner Overview, there are two ways to take advantage of SearchMonkey: Standard Enhanced Results or Custom SearchMonkey Applications.

Standard Enhanced Results

The Site Owner Overview page says this is “Currently available for certain content types such as Video, Games, and Documents”. Sounds good to me; I’m very interested in adding metadata to documents. According to the Documents page, though, “the Yahoo! Search document reader currently supports Flash documents only”. If you want to use RDFa to identify specialized metadata for Yahoo to use when they return your document in a search result list, your document must be stored in a Flash document, and then you embed your metadata in the attributes of an object element that points at that document.

I think it’s great that this lets us use RDFa to assign metadata to slideshare and Scribd documents, but if this has such a strong dependency on a binary format controlled by a single software company, I’m not that interested.

Custom SearchMonkey Applications

OK, so I don’t want to see a shared web publishing infrastructure have such dependencies on this proprietary binary format. The SearchMonkey Getting Started page tells us: “Don’t have Flash objects? Or want to build an app to display custom enhanced results? Head on over to the SearchMonkey Developer Tool to build an app where you can display a custom image, extract structured data from your site, [or] link to pages within your site”. This sounded a bit better.

According to the SearchMonkey Application Dashboard page, “Presentation Applications are small PHP apps that display enhanced search results using data services. You can use an existing data service or create a custom service below”. When I went through the steps of building a Custom Data Service based on an existing one, it asked me for a URL pattern to specify pages where it should look for data and URLs that fit that pattern to use for testing. Then, it showed the XSLT that it would use to extract data, displayed in an edit box where I could customize it.

You use this stylesheet to “specify XSLT code for extracting information from the page and representing that information as DataRSS”. Despite the admonition to “avoid using namespaces in your XPATH expressions, as SearchMonkey strips these out”, this looked like something I could work with once I get to know the DataRSS format. (There’s a schema on that page to use for testing your stylesheet output.)

So if I point Yahoo at some documents and write a stylesheet that goes through those documents and returns DataRSS, SearchMonkey can use this. I could put RDFa in those documents and have my stylesheet get DataRSS data out of that… but I could also make up my own BobFooBar format to embed in the HTML and have my stylesheet get DataRSS out of that as well, so I don’t really see how this counts as RDFa support.

The Semantic Web community is still trying to piece together the nature of Google’s support of RDFa in HTML documents, and there are things to complain about, but we know that their crawlers will look for some sort of RDFa in HTML documents. This looks like a real step forward for support of standards-based metadata on the web by a major search engine. Perhaps my review of the SearchMonkey options is missing something, but so far I haven’t seen anything to show me that what they offer is something for people interested in open web standards to get excited about.

Again, if I’m wrong about any of this, I’d be happy to be corrected.

9 Comments

By Taylor on June 3, 2009 10:04 AM

SearchMonkey is similar to the tripblox concept where other sites provide the RDFa…search monkey sees it, and therefore can list items in a more meaningful way.

I don’t think there are any RDFa tie ins but microsoft bing has this flavor too. You can type “hotels in” and you’re shopping for hotels on a map, but in a vendor/supplier neutral way.

So sites like Expedia, Orbitz, Travelocity write software to list travel search results. We know the content is travel related (hotel/air/car) and have custom views for that…so the search is vertical, a specific domain. Now the horizontal search tools are finding ways to semantically recognize content and list it in horizontal specialized ways. viewzi is another example.

So the very wide, general implication I see is that search tools are getting better, and allowing users to search supplier agnostic, price compare, and then they arrive at the vertical site ready to make a purchase. RDFa makes it possible for the small fries to be seen by “big vertical search” and have their results listed in a very meaningful way, for example, a hotel could be listed just as elegantly on search monkey as it’s listed on expedia…and since search monkey gives you expedia/travelocity/orbitz results + the small fry suppliers with RDFa on their site, where you you start searching?

By Bob on June 3, 2009 10:22 AM

Taylor,

What you’re saying in general makes sense to me, but…

sites provide the RDFa…search monkey sees it

I couldn’t find evidence that SearchMonkey sees any RDFa besides that which is embedded as attributes in object elements that point to Flash files. Other RDFa use by SearchMonkey depends on XSLT translation of that RDFa to DataRSS, which is what SearchMonkey is really using… right?

By Evan Goer on June 3, 2009 11:07 AM

Hello Bob,

Rest assured, SearchMonkey does see the RDFa you add to a page. When the Yahoo! crawler hits your page, we extract any valid RDFa we find. For each URL, we store that data as a chunk of DataRSS XML. DataRSS is our way of normalizing between all the different types of structured data we might have for a page: RDFa, eRDF, various microformats, feeds, Delicious data, anything else.

If the DataRSS on a URL matches a pattern that we’re expecting, then we automatically display that URL as an enhanced result – that’s our Flash video/documents/games functionality. Google Rich Snippets is the same thing, but for different use cases (like reviews, etc.) Rest assured, both teams are working to add more. :)

For arbitrary RDFa where we don’t have an automatic presentation, you can use SearchMonkey to create a custom presentation. The SearchMonkey developer tool allows you to build a little PHP app that digs into the DataRSS XML using XPATH and tells Yahoo! Search how to display that data.

Note that you do not have to write any XSLT to use RDFa. You’re right that if you create a BobFooBar format in your HTML, then we don’t understand that format at all. Which means if you want to get at it using SearchMonkey, yes, you would have to build what we call an “XSLT Custom Data Service.” But if you use RDFa, a format we do understand – then we are essentially running that XSLT for you, at index time.

Finally, you can also call our BOSS Search APIs and get all our RDFa + other structured data back as DataRSS XML or RDF/XML (your choice). Basically, Yahoo! crawls the web harvesting structured data, and you can use BOSS to reflect that data back at you.

Best,

Evan Goer
Yahoo! SearchMonkey Team

By Bob on June 3, 2009 11:22 AM

Thanks Evan, this sounds more promising.

For arbitrary RDFa where we don’t have an automatic presentation

I assume that the RDFa where you do have an automatic presentation is a set of names from specific namespaces, e.g. dc:creator. Is this set documented somewhere? I get the impression from what you write that I can embed RDFa using these names as predicates into an HTML document, and that this metadata may show up as part of a search result.

you can also call our BOSS Search APIs and get all our RDFa + other
structured data back as DataRSS XML

If dc:creator is part of the set documented above, would this let me query the documents for which you have DataRSS metadata stored for dc:creator=‘Tim Berners-Lee’ and have the documents returned if they’re there? Including HTML documents as described above?\

By Evan Goer on June 3, 2009 11:49 AM

That’s right, the automatic SearchMonkey presentations are triggered off of certain namespaces. For example, you can trigger a Video result using media:video and media:thumbnail. You can also change the title, abstract, etc. by including a dc:title or dc:description.

Viewing the metadata in search results: well, beyond fancy presentations, what we’ve got right now are some very crude filters.

With BOSS, you could create something slightly more powerful. You could say, “give me the top 100 results that have RDFa and have the term ‘Tim Berners-Lee’”. Then your BOSS app could sift through these results and return the URLs that have a dc:creator=‘Tim Berners-Lee’. But we don’t yet support arbitrary SPARQL queries into the Yahoo! Search index. That’s more like the “Web Of Objects” that our execs were talking about last month.

By Bob on June 3, 2009 12:59 PM

the automatic SearchMonkey presentations are triggered off of certain
namespaces….media:video… media:thumbnail… dc:title… dc:description.

Is there a comprehensive list of these namespaces and properties somewhere?

But we don’t yet support arbitrary SPARQL queries into the
Yahoo! Search index.

That would be cool, but I think it would be much simpler to simply allow queries that return documents that have the RDFa equivalent of (>, p:foo, “bar”) in them. You tell us what p:foo predicates we can use, we specify “bar”, and you return each document that has p:foo=“bar” in it.

By Evan Goer on June 3, 2009 1:26 PM

For the automatic SearchMonkey presentations, all the namespaces and properties are scattered across the different documentation pages under http://developer.yahoo.search.com/start.

As for supporting a simpler query syntax: I’ll bring it up to our architect!

By Bob on June 3, 2009 1:50 PM

I’m guessing that you meant http://developer.search.yahoo.com/start and not http://developer.yahoo.search.com/start.

Compiling those namespaces and properties into a single document would be a big boost to usage of SearchMonkey by the semantic web community considering how little work it would be.

Thanks again!

By Yarmulka on October 28, 2009 8:03 AM

Just added searchmonkey product objects to our pages. Yahoo tells that products are found but we don’t see it in search results. Very strange.

"Semantic Web for the Working Ontologist"

Bob DuCharme — Wed, 27 May 2009 09:36:52 -0500

And for anyone interested in working with ontologies.

I recently finished Dean Allemang and Jim Hendler’s book Semantic Web for the Working Ontologist, and I strongly recommend it to anyone interested in OWL, RDF, or the Semantic Web. I’m surprised that their publishers even agreed to the title; there may be some people who look at the book’s title and say “Hey, I’m a working ontologist, so I need that book!”, but I think that it would benefit a much wider audience: not just people who consider themselves working ontologists, but anyone who needs to work with standards-based ontologies or with people who do.

The book describes many modeling issues and then shows how to work through them using concrete examples that are explained well enough to generalize them to other domains. Anyone who reads this book and then works with ontologies will come back to it saying to themselves “I know I saw something in here about how to handle this particular information relationship…” Examples are not presented as working code per se, but there are many examples showing a set of triples, a few RDFS and/or OWL statements, and the resulting new triples implied by the combination. Many of these examples made me want to type them into a text editor, run them through Pellet, and then start modifying the examples to see what happened, because to me, those implied triples are the coolest part of OWL: the new facts that you get out of an existing set of facts by adding metadata.

I’ve wondered before about what good RDFS was without OWL. I started to get a better appreciation for the possibilities when I played a bit with Sesame, and Dean and Jim’s book gave me a much better idea of what you can do with RDFS when you don’t have OWL support, so there’s a reason for Sesame developers to get the book.

In addition to showing people who are dabbling with Semantic Web technologies how to get deeper into the technology, the book does an especially good job of showing experienced software developers which aspects of Semantic Web development are different from what they’re used to and why these differences open up new possibilities instead of limiting them. For example:

The ability in OWL to infer class relationships is a severe departure from Object Oriented modeling. In OO modeling the class structure forms the backbone of the model’s organization. All instances are created as members of some class, and their behavior is specified by the class structure. Changes to the class structure have far-reaching impact on the behavior of the system. In OWL, it is possible for the class structure to change as more information is learned about classes or individuals.

And this is a Good Thing! Got that, OO folks? If not, there’s plenty more in the book to demonstrate this to you. For example, an early chapter in the book asks “How can we accommodate variation of sources if we can’t structure the entities they are describing into a class model? The Semantic Web provides an elegant solution to this problem… any model can be built up from contributions from multiple sources”. Or this: “it is never accurate in the Semantic Web to say that a property is ‘defined for a class.’ A property is defined independently of any class, and the RDFS relations specify which inferences can be correctly made about it in particular contexts.”

Some great advice for all software developers:

…you might think that modeling for reuse is best done by anticipating everything that someone might want to use your model for, and thus the more you include the better. This is a mistake because the more you put in, the more you restrict someone else’s ability to extend your model instead of just use it as is. Reuse is best done, as in other systems, by designing to maximize future combination with other things, not to restrict it.

Closing the book with chapters such as “Using OWL in the Wild”, “Good and Bad Modeling Practices”, and a “Frequently Asked Questions” appendix help even more to connect the theory to the practice, and the final chapter’s “Beyond OWL 1.0” section shows what deficiencies the experts currently see in OWL and what kind of new features a future release might offer us. All in all, for people who are strongly interested in OWL and the Semantic Web, or even just a little curious, this book will give you a solid grounding in both the theory and practice of what the technology can bring to new applications that you might be working with.

6 Comments

By Dean Allemang on May 27, 2009 10:47 AM

Thanks for the review, Bob!

A resource you might not be aware of includes source code for most of the examples (soon to come - an ontology browser that will let you examine them and play with inferencing)

Check it out a WorkingOntologist.org

If you find errata (are you using the second printing or first printing?), please record them there, as well.

By Paul Hermans on May 28, 2009 3:01 AM

I agree completely.
It is only a shame that the book was published with so many errors in the code and figures.

By Taylor on May 28, 2009 10:43 AM

Got that, OO folks?

Bob, I would say that even those of us who say we’ve got it don’t, it’s profoundly different. The way I keep things straight is to classify OWL as POP, or Property Oriented Programming. Classes don’t have properties…properties have classes by way of range/domain. Even then I still find myself confused and making invalid assumptions based on my OOP background…in other words, for some it may be simple, but I caution anybody from making quick comparisons between OWL classification and OOP subclassing. It’s required me to think hard and ask questions and get feedback when I get lost.

By Erik Hennum on May 28, 2009 4:02 PM

To shake up the OO mindset (for what it’s worth), the papers at http://www.w3.org/TR/sw-oosd-primer/ and http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf have been helpful to me. It sounds like this book goes much deeper; thanks for the alert.

By Bob on May 28, 2009 4:31 PM

Thanks Erik, I had no idea that that W3C paper was even there. It looks very useful.

By Rick on May 31, 2009 10:55 PM

Bob, thank you for posting this review! I am trying to budget the time to buy and study this book. I still have questions in the SW vs. OO area! I will certainly read the W3C SW vs. OO paper. I also have questions as to how to properly model statement metadata in such a way as not to make my triple store incompatible with other tools such as inference engines. Finally, I would like to know how to properly model higher-order predicates. I guess you could call these SW “recipes”, or “best practices”, or SW patterns + anti-patterns. I hope this book covers some of these topics.

Writing about the Semantic Web

Bob DuCharme — Fri, 22 May 2009 09:51:10 -0500

And Linked Data, and RDF, and RDFa, and SPARQL, and OWL, and...

After writing a few paid articles and doing a lot of blogging about various issues, features, and trends surrounding the Semantic Web, Linked Data, RDF, RDFa, SPARQL, OWL, and related tools and implementations, I thought it would be nice if I could tie them together into something resembling a cohesive whole. So, I wrote a short essay titled RDF, The Semantic Web, and Linked Data with over 70 footnote links to these various pieces. It will be a handy reference for me in the future, and I hope it may be for others as well.

5 Comments

By Dan Brickley on May 23, 2009 6:08 AM

Nice overview! Would you consider mentioning SKOS, FOAF and Dublin Core in any revisions? RDF isn’t so interesting without vocabularies and public data using them…

By Ben Stein on May 23, 2009 8:57 AM

Good essay Bob, was really interesting to read.

If you’re interested in the semantic web technologies, I’d like to reference you to http://www.urlclassifier.com a web-service using NLP and statistical methods for extracting the main discussed topics from web-pages.
Using ContextIn [Semantic Web]{} algorithms

By Bob DuCharme on May 23, 2009 6:07 PM

Dan,

Good idea, I will work those in.

Ben,

Note the part of the essay that says “I find it useful to think of the Semantic Web as being the Linked Data web with the addition of standards-based semantics encoded to help you get more out of that data. As the idea of ‘semantics’ becomes a buzzword for selling web-based technology, the ‘standards-based’ part of this becomes more important”. Can you tell us more about urlclassifier.com’s relationship to W3C semantic web technology standards such as RDF, SPARQL, and OWL?

thanks,

Bob

By Sarah Bourne on May 26, 2009 1:33 PM

One of the hardest part of convincing people of the value of the Semantic Web is explaining it in plain English. This essay is a solid contribution to that goal. Thank you for sharing it!

By Dean Allemang on May 26, 2009 4:39 PM

If you ever have a tough day, and need a break, check out The Daily Puppy :)

Google and RDFa: what and why

Bob DuCharme — Fri, 15 May 2009 19:57:29 -0500

Surprise—to make more money!

After the initial burst of discussion about Google putting their toe into the standardized metadata water, I started wondering about the corner of the pool they had chosen. They’re not ready to start parsing any old RDFa; they’ll be looking for RDFa that uses the vocabulary they somewhat hastily defined for the purpose. Why does the vocabulary define the properties that it defines?

The People properties sound basic enough, although as all the semweb geeks have already tweeted, Google should have leveraged the extensive existing work done on the FOAF vocabulary for that. The other three categories of properties they define are Reviews, Products, and Businesses and organizations. Of all the knowledge domains to represent, why these?

In the words of Drupal project lead Dries Buytaert, "Structured data is the new search engine optimization".

Comparing a given Google project to the big picture of all their projects can be overwhelming, but there’s no need to when you remember what their core business is: putting ads next to search results and charging for the ads when they get clicked. The more relevant the ads are to the content next to them, the more likely they are to get clicked, and the more money Google makes.

In a blog post titled The future of RDFa in February of last year, I wrote that “Pricing is… a huge area where people would be happy to give away data in the form of extra embedded metadata in their web pages, because it can drive new paying customers to the source of that data”. Google wants that data to help people sell more stuff and make more money themselves. The kind of metadata that would be embedded in reviews and information about products and companies—especially the category, brand, and price properties, and the detailed metadata that can be included in reviews—can make it much easier for Google to find users who are using their search engine to research things they’re interested in buying.

It will be interesting to see how the big hustling SEO world adapts to this. In the words of Drupal project lead Dries Buytaert, Structured data is the new search engine optimization. When he writes “Every webmaster wanting to improve click-through rates, reduce bounce rates, and improve conversation rates, can no longer ignore RDFa or Microformats”, it reminds me that when the SEO world eventually gravitates more in the RDFa direction or the microformats direction, these very quantitative, results-driven people will have some real data to explain why. I’ll have to start searching their voluminous discussions out there to see what people are saying.

Some other miscellaneous notes on Google and RDFa:

For now, Google isn’t going to look for this markup in all the data they crawl. As far as I can tell, they want you to nominate your own site to be crawled and parsed for the extra metadata.
It’s nice that Google encourages people to add a proper namespace declaration of xmlns:v=“http://rdf.data-vocabulary.org/" to a web page before adding properties such as v:reviewer and v:description. They even make this their number one “important property”. But, when they parse a document that may contain this metadata, will they check for xmlns:v=“http://rdf.data-vocabulary.org/" and then only look for v:reviewer and the other properties if they find it? Or, if they see xmlns:foo=“http://rdf.data-vocabulary.org/", will they look for foo:reviewer and other properties from their namespace even though they document doesn’t use the prefix from Google’s demo?
They point to the “official” W3C RDFa Primer. (It was a pleasant surprise to be reminded that the Primer’s acknowledgments mention me for “reviewing the work and providing useful commentary”.) Even if Google’s implementation of this will only deal with a limited vocabulary, from what I can see they’re not subsetting the standard itself, like Adobe did with their XMP “profile” of RDF.
Google does see the semantic web world beyond what’s defined in their ontology. According to the Reviews page, “You can use the additional expressiveness of RDFa to provide more information about the subject of your review. Google does not currently use the about property in search results, but it may be used in the future”. Building on this, they reassure the reader about an issue that often confuses those who are new to the use of URIs as identifiers instead of just being URLs: “If the object you’re referring to does not have an obvious URL to include, you could use the URL of pages on Wikipedia or similar web sources”.
It was nice to see how quickly a community effort led by Kingsley Idehen put together an ontology (explore it here) defining relationships between Google’s properties and more well-established ones, complete with owl:equivalentProperty properties defined to help clean up the potential mess of the vaguely defined delimiters between the http://rdf.data-vocabulary.org URI and each property name. (See here, near the bottom for an example.) This could become a canonical example of the value of ontologies.

It will be a lot of fun to build apps that use RDFa found by Google…

6 Comments

By Daniel O’Connor on May 15, 2009 10:48 PM

I only wish that I could make blogger output xhtml strict - but I can’t, because of how they throw in some iframes and what have you.

This means I can’t swap my doctype over to xhtml+rdfa and weave in their new information properly.

Annoying.

By Mark Birbeck on May 16, 2009 2:45 AM

Daniel,

The doctype is optional.

Mark

By Michael Hausenblas on May 16, 2009 3:20 AM

Bob,

Good post, I by and large agree (esp. re semantic SEO) - see also my 2c at [1].

Cheers,
Michael

[1] http://lists.w3.org/Archives/Public/public-lod/2009May/0095.html

By Tony Hammond on May 16, 2009 8:33 AM

Nice post, Bob.

Re your 2nd bullet, this is really encouraging news. A shame that Google Scholar persists in not making a namespace available for its vocabulary. For an example, see this post on Nasecnt about Nature’s inclusion of META tags, and compare the DC and PRISM vocabularies which have declared schemas with the Google Scholar tags which have no decalred schema. In fact, I couldn’t find any web page for this vocabulary other than “contact us” type links.

This new approach to including namespaces is refreshing.

Tony

By Eric Hellman on May 20, 2009 9:57 AM

I’m also disturbed by all the careless mistakes that google has left in their help documentation at http://google.com/support/webmasters/bin/answer.py?hl=en&answer=146898

I have also commented at http://www.google.com/support/forum/p/Webmasters/thread?tid=165a6bebc77f2217&hl=en

Who knows what they’ve actually implemented.

By Bob DuCharme on May 20, 2009 10:50 AM

Eric: gluejar?

Maybe they’re going with a “release early, release often” strategy and crowdsourcing the QA of the design to those who show an interest, like us…

Bob

Semantic web technology and humanities research

Bob DuCharme — Wed, 29 Apr 2009 18:13:55 -0500

A Canadian historian uses semantic web technology to do interesting research and to lay the groundwork for others to do so.

I’ve attended and given a few Scholar’s Lab talks at the nearby University of Virginia, and I’m kicking myself for missing a recent talk by Mount Allison University’s Bruce Robertson, whose field at Mount Allison is ancient Greek and Roman history. (A podcast of his Scholars Lab talk is available here.) He’s the main guy behind the Historical Event Markup Linking Project (HEML) and apparently even the people who brought him to UVa to give his recent talk were surprised at how far he’d refocused his XML orientation toward semantic web technologies.

A few quotes from his presentation:

The semantic web stack… allows a schema to be always growable in a federated way. You can add to my schema and I can’t do anything about it, and that’s a wonderful, wonderful thing.

"You can add to my schema and I can't do anything about it, and that's a wonderful, wonderful thing".

I agree. While extensibility of a given XML DTD or schema must be designed into it from the start, RDFS and OWL schemas allow a lot more flexibility and therefore more possibilities to build on the work of others. On a related note, here’s my favorite quote, which was a bit of a lightbulb moment for me:

If in the XML world the schema next door is just a stylesheet away, in the RDF world, the schema next door can be reasoned into, so you can include reasoning rules so that the same server is providing data in very many different flavors. I think this is an underexplored and exciting aspect of RDF, that if we have multiple schemas, as we do in the humanities, and we’re not going to agree on one, we can just do all of them.

When I give an XSLT class I like to provide some introductory historical background before I show the first stylesheet. I always say that the main growth driver for XSLT’s popularity was that people got tired of waiting for the shareable DTDs that they heard about when XML was first released—they just decided to send and accept whatever XML had the information they needed and and then write stylesheets to rename and rearrange that XML to fit into their systems. I never thought of RDF-oriented schemas the same way, but I now I realize that they’re all that and more, because it’s much easier to combine multiple RDFS/OWL schemas for a single application than it is to combine multiple XML schemas/DTDs. (As a side note, I’m currently reading Dean Allemang and Jim Hendler’s book Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL and I’m learning a great deal. I’m familiar with most of the components of RDFS and OWL that they explain, but their advice on how to put those pieces together has taught me a lot and given me many ideas.)

The Semantic Web community is sometimes accused, even from within, of being an echo chamber of tools vendors and open source developers telling each other about their latest features. A corollary issue is that these people must hear more from users about their needs, and Bruce’s talk is just the kind of thing they need to hear. His talk that I link to above covers issues such as what went well for him as he built his application, what didn’t, the mining of Wikipedia/DBPedia for historical research, issues he found with the representation of time and languages of content… it’s great stuff. Too bad it’s too late for him to get on the bill of the Semantic Technology conference; in a recent Semantic Web Gang discussion, Reuters Clearforest’s Tom Tague discussed his hopes that more non-industry people would help make this conference less echoey than it had been in the past. To be honest, he actually said he was hoping to see more “business users”; perhaps, to get more non-semweb geek perspectives, we should think about how much non-computer science academic people can contribute to the discussion as well. Bruce Robertson is a great example.

An epub comic book

Bob DuCharme — Tue, 21 Apr 2009 09:59:29 -0500

From the golden days of goofy comics.

My brother-in-law works for a company that handles licensing and republishing of a lot of comic strip publishing. Lately we’ve been discussing the issues involved with republishing such image-based content as electronic books. (After these discussions, my wife asks “How’s your sister?” and I say “Uhh, OK I guess.”) I wanted to give it a shot, and found that a Google search on public domain comics got plenty of hits.

I managed to find a pretty cool one called the Blue Beetle, and issue 2 from 1955 has panels that are all the same size, which helped me to sidestep one of the tougher comics-as-ebooks issues, so I created an epub Blue Beetle comic. In the epub version the images are all rotated horizontally, so it really is aimed at smaller devices that you can easily turn 90 degrees such as the Sony Reader and the iPhone. Tests by friends show that they look pretty good on each, although the images are based on scans of 50 year-old cheaply printed paper which probably had a high acid content, so it’s not as crisp as it might be. Text on anything but a white background is difficult to read on the iPhone.

Still, overall, it’s pretty cool. Check it out yourself and let me know what you think.

6 Comments

By Bill Trippe on April 21, 2009 11:09 AM

You are the man!

By Erio on November 11, 2009 4:31 PM

ok, i just downloaded it… As soon as i get my PRS-600 i’ll try it and let you know my opinion… thank you for the work tho =)

By Erio on November 11, 2009 4:32 PM

ok, i just downloaded it… As soon as i get my PRS-600 i’ll try it and let you know my opinion… thank you for the work tho =)

By Marc Hansen on February 12, 2010 1:09 PM

There’s an epub comic book test here if anyone wants to try it.

By Will on March 16, 2010 11:44 PM

I wrote a program to convert comic to EPUB. It’s free and open source for anyone who wants it. I was getting tired of manually converting them. :)

The program automatically converts CBR and CBZ files. It also cleans up the scanned images so that they look nicer on an eBook reader. It’s all pretty much automatic. Windows only though (sorry), but open source if anyone wants to port it.

To download it go here:
http://comictoepub.sourceforge.net/

By Bob on March 17, 2010 9:16 AM

Will: looks cool! Can you post links to some sample converted comics?

Expand those shortened URLs before archiving twitter messages

Bob DuCharme — Tue, 14 Apr 2009 09:51:30 -0500

What if a shortening service goes down?

People love to talk about the implications of twitter.com going down, but what if a URL-shortening service goes down? When I had trouble getting to is.gd recently, I realized that when they’re down tweets referencing is.gd URLs are worthless—and that it wouldn’t be too difficult to do something about it before this happens. (I have wondered, though: why doesn’t twitter grab some short domain name and offer their own shortening service?) After all, if you’re saving any tweets, why save them with a dependency on some potentially fly-by-night point of failure?

My wrapShortenedURLs.py python script, available at http://www.snee.com/xml/twclient/wrapShortenedURLs.py.txt, looks for URLs from five shortening services (defined in a list at the top of the script, in case you want to add others) and wraps those URLs in an HTML a element with an href attribute storing the URL that the shortened URL redirects to. For example, it will turn ‘See http://is.gd/p3zb for Joseph Beuys fronting a bad German New Wave band’ into ‘See http://is.gd/p3zb for Joseph Beuys fronting a bad German New Wave band’. (When writing the script, tweets with multiple shortened URLs were the difficult part, requiring an upgrade to my skill with Python regular expression functions.)

I’ve tested this with some XML pulled down using the twitter API and with a CSV file from tweetake.com, a service that lets you back up information you’ve stored on twitter, and it seems to work fine. I’ll be using it with all my archived tweets from tweetake.com from now on, and if I ever write my own twitter archiving routine using the API, this will certainly be a part of it.

6 Comments

By Norman Walsh on April 14, 2009 10:02 AM

Sure seems like a good idea to me!

By Jim Fuller on April 14, 2009 10:38 AM

url shortening is probably turning out to be a bad idea and even though I have use/d it, completely agree

By Martin Probst on April 14, 2009 11:05 AM

Now you’ve made me look at that truly horrible video…

By Libby on April 14, 2009 7:03 PM

fwiw, I wrote a small (and probably not very good) ruby script to expand tinyurls: http://planb.nicecupoftea.org/2009/02/02/expand-tinyurls-using-ruby/

By leo sauermann on April 19, 2009 5:45 PM

for what reason would you archive twitter messages? the content is intended to be outdated after a day.

archive blogs!

By Bob DuCharme on April 19, 2009 5:53 PM

Leo,

Of course I archive blogs. Some consider twitter to be a “microblog”, and since I sometimes use it to mention interesting websites I’ve found, I like to archive those as well.

Many twitter messages are very ephemeral, and many aren’t. I usually don’t follow people who tweet things like “just finished breakfast”, because I prefer the ones that say a little more. The people posting those may well consider them worth archiving.

Getting started with AllegroGraph

Bob DuCharme — Wed, 08 Apr 2009 16:50:33 -0500

Via Python and via HTTP.

The home page of Franz Inc.’s AllegroGraph RDFStore calls it “a modern, high-performance, persistent RDF graph database” that “scale[s] to billions of triples while maintaining superior performance”. Franz offers a free version that lets you store up to 50 million triples, so I installed and played with release 3.2 of the Windows version. When I tried it, the documentation and examples were not well coordinated with the configuration of the latest release, but Franz’s email support was very responsive and helpful, even to a non-paying customer like me. I’ve also seen some evidence that they’re bringing this documentation up to date.

For each triplestore I’ve played with, I tried to avoid coding and compiling. I didn’t see any web interface or command line tool for loading RDF triples into AllegroGraph and then querying the data using SPARQL, so I started with its Python interface and then tried the HTTP interface. I first learned Python several years ago because of all of the RDF-related libraries out there, so I’m happy to write some scripts with it. It would be interesting to try AllegroGraph’s LISP interface, but my last experience coding in LISP is some time ago, so there’d be some catch-up time.

The AllegroGraph server

AllegroGraph’s setup routine configured it to automatically run as a service under Windows. After some early frustration with the Python client, I discovered that this copy of the server was not being started up according to assumptions made by the sample code in AllegroGraph’s Python API for AllegroGraph tutorial. For one thing, a line in the tutorial’s first Python script tells the server to open up the “ag” catalog—according to the tutorial, a repository is another term for an RDF triplestore, and a catalog is a container for a set of repositories—but the server didn’t know about this catalog. I shut down the AllegroGraph service (in Windows, from Control Panel/Administrative Tools/Services, right-click “AllegroGraph Server” and pick “Stop”) and then started it up from the Program Files\AllegroGraphFJE32 directory with this command, which specifies a directory included with the AllegroGraph distribution as the catalog location:

AllegroGraphServer --new-http-port 8080 --new-http-catalog doc/agraph-javadoc/com/franz/ag

This also tells the server to use port 8080, which is where the Python tutorial’s sample scripts send their requests.

A little Python client

The AllegroGraphFJE32/doc/server-installation.html file included with the distribution recommends that Windows users use ActiveState’s version of Python, which may explain some of my other early problems with the Python interface. I also found mistakes in the Python tutorial’s sample code; instead of listing these problems, I’ve posted my script, which includes corrected versions of the first few examples, at http://www.snee.com/rdf/agdemo.py.txt.

The script creates a repository in the ag catalog, loads the same RDF files that I loaded into other triplestores I’ve tried, and sends the server the “SELECT DISTINCT ?p WHERE {?s ?p ?o}” query I usually use to start any SPARQL session. I commented this Python script script where I could, so I won’t describe it here. For now, AllegroGraph’s documentation of their Python interface is skimpy, but better documentation is on the way. You can learn more about AllegroGraph’s Python interface from this blog posting by someone in Austria named “Rho”. Keep in mind that Rho’s examples use release 3.1.1, and apparently improvements to the Python client were an important part of AllegroGraph’s upgrade to release 3.2.

Trying the HTTP interface

AllegroGraph’s currently available documentation of their HTTP interface provides no examples of complete URLs to send to the server, so it took me some time to work out the correct format, but once I did, it was pretty straightforward to use. (As with the Python interface documentation, I heard that better HTTP interface documentation is on the way.) One other caveat: when I tried this with a recent distribution version of release 3.2, some of these commands didn’t work until after I’d picked “Download AllegroGraph 3.2 Free Java Edition Updates” from the AllegroGraph program group on the Windows Start menu.

AllegroGraph’s HTTP interface documentation says that if you start the server with the -new-http-port option, as I did, then you should us the separate documentation for their new HTTP server. I used cURL to send URIs to the server’s HTTP interface.

To list existing repositories, the following query retrieved a SPARQL query results XML format listing with fields for the uri, id, title, readable, and writable status of each repository:

curl http://localhost:8080/catalogs/ag/repositories

This is an important command, because many others require you to supply a repository id.

This next command following successfully created a new repository with an id of test1 (all curl commands were actually issued as one line; I added carriage returns here for readability):

curl -X PUT -H "content-type: application/x-www-form-urlencoded; accept: */*" 
  http://localhost:8080/catalogs/ag/repositories/test1

The first time I tried it I saw no response, but the second time I was told “there is already a store named ’test1’”, which was good news.

The following command added triples from the indicated disk file to the test1 repository:

curl -X POST -T \bob\dev\xml\rdf\fakeaddrbookpt1.rdf -H "content-type: application/rdf+xml"
  http://localhost:8080/catalogs/ag/repositories/test1/statements"

(April 9th correction: when I posted this entry yesterday, the preceding command and the remainder of this paragraph had the POST and PUT references backwards, so I just fixed them.) I found that without that “-X POST” in the command line, either the server or curl assumed that I was PUTting data. An HTTP PUT replaces any existing data in the repository, so if you want to add several files to the same repository, make sure to explicitly POST them there.

The next command sent an escaped SPARQL query to the server, which sent back a SPARQL query result format list of the predicates used in the data that I had loaded:

curl -H "Accept: application/sparql-results+xml" 
  http://localhost:8080/catalogs/ag/repositories/test1?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D

Querying sets of named graphs

Using the HTTP interface, I also managed to reproduce my experiment with named graphs described at Querying a set of named RDF graphs without naming the graphs. (See that posting for background on what I was trying to accomplish, the sample data files, and the queries I used. And, if you’re interested in named graphs, don’t miss the discussion between Paula Gearon, Lee Feigenbaum, and Andy Seaborne in the comments section of that post.) Following the steps described there, I first loaded the mybluegraph.rdf file into the graph named http://www.snee.com/ng/mybluegraph.rdf (or, in AllegroGraph terms, into the context named http://www.snee.com/ng/mybluegraph.rdf):

curl -X POST -T \bob\dev\xml\rdf\sparql\namedgraphs\mybluegraph.rdf 
  -H "Content-Type: application/rdf+xml" 
  http://localhost:8080/catalogs/ag/repositories/test1/statements?context=%3Chttp%3A%2F%2Fwww.snee.com%2Fng%2Fmybluegraph.rdf%3E

Then I loaded myredgraph.rdf into the http://www.snee.com/ng/myredgraph.rdf graph with a similar command:

curl -X POST -T \bob\dev\xml\rdf\sparql\namedgraphs\myredgraph.rdf 
  -H "Content-Type: application/rdf+xml" 
  http://localhost:8080/catalogs/ag/repositories/test1/statements?context=%3Chttp%3A%2F%2Fwww.snee.com%2Fng%2Fmyredgraph.rdf%3E

I loaded mygreengraph.rdf without specifying a graph in which to load it:

curl -X POST -T \bob\dev\xml\rdf\sparql\namedgraphs\mygreengraph.rdf 
  -H "Content-Type: application/rdf+xml" 
  http://localhost:8080/catalogs/ag/repositories/test1/statements

A query for all dc:title values retrieved them from all three files,

curl -H "Accept: application/sparql-results+xml" http://localhost:8080/catalogs/ag/repositories/test1?query=PREFIX%20dc%3A%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%20select%20%3Ftitle%20WHERE%20%7B%3Fs%20dc%3Atitle%20%3Ftitle%7D%0A

but a query for dc:title values from graphs that were subgraphs of http://www.snee.com/ng/mygraph.rdf only retrieved the redgraph and bluegraph ones, just as I’d hoped:

curl -H "Accept: application/sparql-results+xml" 
  http://localhost:8080/catalogs/ag/repositories/test1?query=PREFIX%20dc%3A%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%20PREFIX%20rdfg%3A%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F03%2Ftrix%2Frdfg-1%2F%3E%0Aselect%20%3Ftitle%20WHERE%20%7B%20%3Fg%20rdfg%3AsubGraphOf%20%3Chttp%3A%2F%2Fwww.snee.com%2Fng%2Fmygraph.rdf%3E%20GRAPH%20%3Fg%20%7B%3Fs%20dc%3Atitle%20%3Ftitle%7D%20%7D%0A

As one of the commercial triplestores, AllegroGraph looks very scalable, and as I mentioned, their support is very good. Franz has been holding some webinars about large-scale applications of their server lately, and an upcoming one on Solving Scale and Reasoning in Large RDF Datasets looks interesting; Franz distributes the Racer Description Logics reasoner in much of the world, so I assume that it will play a role in this reasoning application.

4 Comments

By Dan Brickley on April 9, 2009 4:42 AM

Thanks for the writeup! I never got any further than the impression I needed to write java code to talk to the db.

Have you figured out any of the social network analysis stuff? http://danbri.org/words/2008/06/02/327

http://www.franz.com/agraph/support/documentation/3.0/reference-guide.html#header3-65

eg. can i fill it full of mail headers and foaf and do clustering to find out which groups and lists are interconnected?

By Bob DuCharme on April 9, 2009 8:26 AM

Thanks Dan! I was going for breadth more than depth with these, trying to follow through on the same set of baseline tasks with each triplestore. Particularly with the commercial ones like AllegroGraph and OpenLink, there are weeks’ worth of features to play with.

Bob

By Robert (rho) Barta on April 10, 2009 2:51 AM

Hi Bob.

If you want to keep your coding at a minimum, then maybe watch Perl RDF::AllegroGraph::Easy evolve on CPAN. I’ll progress it as spare time allows.

Worth noting is also that AllegroGraph seems to be more than “just an RDF store” as it can host tuples (and not just triples). But I admit that I have not yet fathomed out this thing yet.

I can recommend these webinars, especially Jans Aasman talking. But they only last about an hour and cannot get very deep. Experimenting with the code remains a must. Good for me, as a consultant ;-)

BTW, it’s rho, not Rho. And, yes, I’m sailing under no flags to avoid angry ladies emailing me about my schroedinger’sch cat experiments…

By Bill on June 6, 2009 4:44 PM

I’ve also started a Google Group for AllegroGraph users, called, cleverly enough, “AllegroGraph-users”. You can sign up at

http://groups.google.com/group/allegrograph-users

Setting up your microcomputer facility

Bob DuCharme — Thu, 26 Mar 2009 08:56:23 -0500

A 1985 filmstrip. Not a slideshow, but a filmstrip. And dig that funky music.

(Apparently the video isn’t on vimeo anymore)

My own little Twitter client

Bob DuCharme — Wed, 25 Mar 2009 18:49:40 -0500

No AJAX, Flash, or AIR; just HTML, but arranged the way I want it.

I’ve tried various Twitter clients, but usually just went back to the twitter.com web-based interface that people hate so much. My main complaint with it—and I saw no other clients that did any better—was that it showed tweets in reverse chronological order. Conversations and the multi-tweet mini-essays that some people write are difficult to read that way, so I decided to write my own little client.

It's easy to think of new features to add, but in its current state it scratches the itch that I had, so I'll leave it alone.

It’s a simple python script that uses the python-twitter interface. (The zipped distribution version there is a bit out of date; my script has a comment at the top about what to do.) My twitter client python script checks for a disk file that identifies the last tweet that I read, pulls all tweets since then, and then creates a web page showing those in chronological order. (Sample page here.) The little ← arrow after each entry lets you reply to that tweet if you’re logged in on the web client, and an additional, slightly different ↵ arrow lets you link back to a message being replied to. Mouseover text makes the meaning of the cryptic little arrows clearer.

I was going to also have the script check for new direct messages, but twitter sends me email the rare times that I actually get one of those, so I won’t miss any. I also considered adding entries showing the results of a twitter vanity search, but python-twitter doesn’t support the search interface and I have an RSS feed to alert me to that anyway. It’s easy to think of new features to add—when I tweeted that I was working on this, more suggestions started coming—but in its current state it scratches the itch that I had, so I’ll leave it alone.

Coding around twitter’s API is easy if you don’t want to implement a fancy UI. Some of the comments I received about features to add suggested easier following of threaded conversations, and the API gives you what you need to do that, once you decide on a UI. That’s how I added the link for the second arrow mentioned above.

For now, when I want to read recent tweets from my friends, I run a batch file that runs the python script and displays the resulting HTML file. I’ll probably make something to trigger it with a CGI so that I can check for updates by just clicking a button.

I’ve put the python script at http://www.snee.com/xml/twclient/getNewTweets.py.txt and the tweets.css stylesheet that the output references in the same directory. If you’ve never played with Twitter’s API, see part 1 and part 2 of my DevX article on it. python-twitter makes it pretty easy, but twitter’s RESTful native interface makes it easy to write a client in any language—even XSLT, if you use cURL to retrieve the data from the server, because while XSLT engines can do HTTP GETs, I know of none that can do the authenticated GETs required for most calls to the twitter API.

3 Comments

By Dorai Thodla on March 25, 2009 10:16 PM

Nice. I always wanted one which I can customize. For example, when some one follows you, I would like to click on the person’s account and get a tag cloud to see whether I have any interest in following them.

I don’t care much for AIR even though I use Twhirl a lot since it is easy to retweet.

Thanks for sharing.

regards,
Dorai

By Marty on April 16, 2009 5:46 PM

What do I do with the python script to run it on windows?

By Bob DuCharme on April 16, 2009 7:36 PM

First download it, rename it as getNewTweets.py, and change the lines that set the username and password to use your own.

I run it with a batch file that looks like this:

python getNewTweets.py > temp.html
temp.html

Getting started with Open Anzo

Bob DuCharme — Thu, 19 Mar 2009 19:55:09 -0500

Don't miss the exciting command line video demo!

Open Anzo is the third disk-based triplestore that I managed to set up, load with a few files of RDF data, and query with SPARQL. Its home page describes it as “an open source enterprise-featured RDF store and service oriented middleware platform that provides support for multiple users, distributed clients, offline work, real-time notification, named-graph modularization, versioning, access controls, and transactions with preconditions”.

Before I describe my experience setting it up, loading sample data, and querying that data, take a look at Lee Feigenbaum’s short video demonstrating the use of Open Anzo’s command line interface:

He’s using Linux in the video, but I managed to perform similar queries using Open Anzo under Windows XP. I got the impression from one documentation web page that the product requires DB2, Oracle, PostgreSQL, HSQLDB, or Apache DB on the back end in order to run it, but you don’t need any external database manager to try it out. It is nice to know that these database managers are options as your storage needs scale up; a readme file mentions that it also supports MySQL, and documentation for configuring Open Anzo to hook up to each of these database managers is easy to find on the openanzo.org web site.

After I downloaded release 3.1 of the Open Anzo full distribution and unzipped it, I set the ANZO_HOME environment variable to the name of the directory where I had unzipped it and then ran the startAnzo.bat script that started the server. (Once the server is started, sending a browser to http://localhost:8080/status shows whether you’ve got it up and running properly.) The server gives you an “osgi>” prompt in the command window where you started it up. Entering “help” at this server prompt shows you various things you can do there, but I didn’t play with that much.

Once the server is running, you can interact with it using a command line client, as Lee demonstrated in his video. From a Windows operating system prompt, you do this by supplying parameters to the anzo.bat script. In addition to the ANZO_HOME variable, the window where you issue these commands also needs the ANZO_CLI_HOME variable set; I pointed it to the same directory.

Entering “anzo help” lists the various anzo commands, and entering a command name after “help” like this tells you about that command:

anzo help query

Before you issue your first successful command, you also need to make sure that the server recognizes you as a legitimate user. I used peter as a username and 123 as a password, because I found these in the configuration\anzo.ldif file. Open Anzo offers options to point the client to a username and password pair stored in a configuration file, which is why Lee didn’t need to include them on the command line in his video, but I just added them to each anzo command with the -w and -u switches. The following two commands each loaded a file of RDF data into the named graph identified by the -g option:

anzo import -w 123 -u peter -g http://whatever.com/g1 \bob\dev\xml\rdf\fakeAddrBookPt1.rdf
anzo import -w 123 -u peter -g http://whatever.com/g2 \bob\dev\xml\rdf\fakeAddrBookPt2.rdf

The following query then asked for a list of all the predicates used by triples in the http://whatever.com/g1 graph:

anzo query -u peter -w 123  "SELECT DISTINCT ?p FROM <http://whatever.com/g1> WHERE {?s ?p ?o}"

The next query doesn’t mention a specific graph, but it does include the -a switch, which tells Open Anzo to query against a merge of all the named graphs in the repository:

anzo query -u sysadmin -w 123 -a "SELECT DISTINCT ?p WHERE {?s ?p ?o}"

Both queries worked just fine. As I mentioned in an update to last week’s post, I also managed to query a set of graphs at once in Open Anzo based on metadata associated with the graph.

As I understand it, Open Anzo once included a RESTful SPARQL endpoint to provide an HTTP interface, and although some more recent builds didn’t include this, it’s being put back in. I couldn’t get it to work in a few tests with curl, but I’m going to keep trying with future builds.

As with Virtuoso, Open Anzo has an impressive list of features beyond the simple ability to load and query triples that I’ve demonstrated here. I love the command line interface, and Lee’s video quickly demonstrates a lot of cool things you can do with it. I look forward to playing more with Open Anzo.

1 Comments

By Bernhard Schandl on August 13, 2009 4:07 PM

Bob,

are you sure Open Anzo works with a file-based triple store? As far as the documentation reads, without an underlying RDBMS an in-memory store is used, which means your triples will be gone once the server is shut down.

Some use cases to implement using SPARQL graphs

Bob DuCharme — Sun, 15 Mar 2009 17:34:34 -0500

Or not; I'm open to suggestions.

As I wrote in my last entry, I’ve recently figured out how to assign metadata to RDF graphs and to perform SPARQL queries on sets of those graphs. I’m working a bit backwards here, because I’m now moving on to the use cases that got me thinking about this in the first place. It’s easier to think about them now that I know that I can implement them using standard syntax and multiple open source implementations of that standard. I wanted to outline my ideas about how to implement these use cases to see if they sound particularly good or bad to others. They’re general enough that they’ll apply to other situations.

Simple aggregation of distributed data

Let’s say I have a collection of RDF data that mirrors several sets of data on the Internet. I want to query the aggregate set without retrieving every set from its original source with every query. It’s not very time-sensitive data, so updating the central collection once every 24 hours is fine. “Updating” is the key operation here; if someone deletes a triple from one of the satellite collections, I want to be confident that it won’t be in my aggregate collection the next day, so here’s what I would do.

I name each graph in my internal collection after the source of its triples. To update the data from source http://www.greatdata.org/latest.rdf, a cron job does the following at 3:14 AM each morning:

Delete the triples in the http://www.greatdata.org/latest.rdf graph in my collection.
Load the latest data from http://www.greatdata.org/latest.rdf into the graph with that name in my collection.

Add some triples like the following to a graph dedicated to tracking such downloads:

<http://www.greatdata.org/latest.rdf>
<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#fileLastAccessed>
"2009-03-15T03:14:52-0500".

Wiping out a set of data and completely replacing it will only scale up to a certain point, and a SPARQL UPDATE ability will be a better way to implement certain variations on this, but if the total aggregate size is just a few dozen megabytes, the general approach above makes sense to me. Does it look horribly wrong to anyone else?

Identifying a triple’s provenance

This time, instead of replacing each graph with a more updated version, I want to aggregate all the downloaded data as it accumulates. I assign each downloaded batch its own graph URL and assign metadata to this new graph such as the source, date, and time of the retrieval. I could also assign it rdfg:subGraphOf values, depending on which sets of graphs I was defining for querying, updating, and access control purposes.

To move on to a usage scenario, let’s say that Kendall Clark queries a service on snee.com and finds this triple:

<http://clarkparsia.com/weblog/2008/10/31/we-won/> dc:creator "Bijan Parsia".

He contacts me and says “Bijan didn’t write that! I did! Where are you getting this data?” I check and see that this triple is part of the named graph http://www.snee.com/ns/graphids#i23F2A9, so I query the metadata associated with this named graph and find this:

        <http://www.snee.com/ns/graphids#i23F2A9> 
        dc:date "2008-11-01T17:37:00";
        dc:source <http://planetrdf.com/index.rdf>.

I tell Kendall that I got that triple from the Planet RDF RSS feed at 5:37 PM GMT from the Planet RDF feed.

Again, does the general outline of what I describe here make sense, or would there be a better way to approach it?

4 Comments

By Simon Reinhardt on March 15, 2009 9:19 PM

Hi Bob,

Jeni wrote an interesting piece on this the other day. Lots of relevant comments, too.
Not speaking of my comment, of course! ;-) However the reason I’m referring to this is that I put my ideas on using the HTTP vocabulary in there which I think is relevant to your second use case. It’s restricted to cases where you dereference HTTP URIs but in those cases it gives you very detailed control.
Other relevant vocabularies: http://tw.rpi.edu/2008/sw/archive.owl# http://web.resource.org/rss/1.0/modules/syndication/ http://wiki.foaf-project.org/ScutterVocab
Hmm, lots of links. Usually that’s the point where my comment ends up in the spam box. ;-)

Simon

By Bob DuCharme on March 15, 2009 10:54 PM

Thanks Simon! I’ve certainly been following Jeni’s work there. The HTTP vocabulary looks very useful, although a namespace prefix of http is bound to be confusing–an end-tag with /http:httpVersion between the <> could be pretty confusing to people who aren’t hardcore markup geeks.

The other vocabularies also look useful, but I have to wonder if some spokes of the Library of Congress MARC-based metadata wheels (e.g. METS, EAD) got reinvented in there. If so, they’ll make great demos for OWL equivalency predicates…

Bob\

By Taylor on March 25, 2009 3:20 PM

Your description here on the provenance/named graphs is good…but I don’t think I like named graphs as the solution to provenance, because any given triple could flow through many graphs…ie, tracking down the genesis of the triple is more or less archeology…or even worse, trying to track the ownership of a penny.

What if each triple were really a quadruple, the 4th item being a URI to the authority that gave birth to the statement. That URI might also point to a thing that is an instanceof “provenance node” with more stuff like the time, and person who asserted the fact, or if it’s inferred.

By Crystal on April 5, 2009 7:02 PM

great post.. thanks

No Prescription Needed

Querying a set of named RDF graphs without naming the graphs

Bob DuCharme — Tue, 10 Mar 2009 20:01:38 -0500

A big step toward using named graphs to track provenance.

I’d like to thank everyone who added comments to my last post, Some questions about RDF named graphs. Lee Feigenbaum wrote an entire blog post addressing the issues I raised, and it looks like his Open Anzo triplestore (which I’ll write up in its own post soon) has some nice support for versioning, access control, and replication.

It all worked fine in Sesame and Virtuoso.

Jeni Tennison’s comment was a bit embarassing, because it showed that the answer to my key question was right in the SPARQL specification. I have read the entire spec, but didn’t understand the point of named graphs at the time, so that part didn’t sink in the way it should have.

To review my third question, which built on the first two:

If we’re going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let’s say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 “FROM NAMED” clauses in my query? (Let’s assume that I plan to query those same 432 multiple times.)

I want to put each batch in its own graph so that I can store metadata for each batch. I also want to write a query that retrieves triples from a set of graphs, and when new graphs are added to the set, I don’t want to have to rewrite the query. Based on the example that Jeni pointed me to, I now know how to do this, and I assembled a working demo. It’s a pretty low-level demo; in my next posting I’ll describe one or two more real-world scenarios of applying these ideas, because I’d like some opinions on whether the architecture I have in mind makes sense.

The SPARQL spec explains why I don’t need multiple FROM NAMED clauses to issue a single query against multiple graphs: “the GRAPH keyword is used to match patterns against named graphs. GRAPH can provide an IRI to select one graph or use a variable which will range over the IRI of all the named graphs in the query’s RDF dataset”. So, if I use a variable that ranges over 432 named graphs, I just need a pattern to identify those 432 graphs—ideally, a pattern that still works the following week if I need it to range over 433 graphs, then 434 graphs, and so forth.

For my demo, I created three named graphs and a query that retrieves data from two by using a GRAPH pattern instead of explicitly naming them. Each graph assigns a Dublin Core title value to a book whose identifier is based on its ISBN, and the first two graphs identify themselves as subgraphs of http://www.snee.com/ng/mygraph.rdf.

My first graph:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdfg="http://www.w3.org/2004/03/trix/rdfg-1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">


  <rdf:Description rdf:about="http://www.snee.com/ng/mybluegraph.rdf">
    <rdfg:subGraphOf rdf:resource="http://www.snee.com/ng/mygraph.rdf"/>
  </rdf:Description>


  <rdf:Description rdf:about="urn:isbn:1-93-022011-1">
    <dc:title>XSLT Quickly</dc:title>
  </rdf:Description>


</rdf:RDF>

(Thanks also to Jeni for pointing me to the http://www.w3.org/2004/03/trix/rdfg-1/ vocabulary for describing graph relationships.) Second graph:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdfg="http://www.w3.org/2004/03/trix/rdfg-1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">


  <rdf:Description rdf:about="http://www.snee.com/ng/myredgraph.rdf">
    <rdfg:subGraphOf rdf:resource="http://www.snee.com/ng/mygraph.rdf"/>
  </rdf:Description>


  <rdf:Description rdf:about="urn:isbn:0-13-082676-6">
    <dc:title>XML: The Annotated Specification</dc:title>
  </rdf:Description>


</rdf:RDF>

The third graph, whose data shouldn’t show up in the query results, because it’s not a subgraph of http://www.snee.com/ng/mygraph.rdf:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdfg="http://www.w3.org/2004/03/trix/rdfg-1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">


  <rdf:Description rdf:about="urn:isbn:0-13-475740-8">
    <dc:title>SGML CD</dc:title>
  </rdf:Description>


</rdf:RDF>

The following test query retrieved the titles from all three graphs, because it has no qualifications about which graphs to retrieve from:

PREFIX dc:<http://purl.org/dc/elements/1.1/>


select ?title WHERE {?s dc:title ?title}

This next query, however, only wants dc:title values from graphs that are subgraphs of http://www.snee.com/ng/mygraph.rdf:

PREFIX dc:<http://purl.org/dc/elements/1.1/>
PREFIX rdfg:<http://www.w3.org/2004/03/trix/rdfg-1/>


select ?title 
WHERE { ?g rdfg:subGraphOf <http://www.snee.com/ng/mygraph.rdf>
        GRAPH ?g {?s dc:title ?title}
}

It all worked fine in Sesame and Virtuoso. (One note: when you load a graph into Sesame using its workbench interface, you can specify a URL for the graph’s name, so I chose names of the form http://www.snee.com/ng/myredgraph.rdf shown in the sample data files above. When loading the graphs into Virtuoso, its default behavior is to assign graph name URLs based on the URL of the WebDav folder used to load it—see my earlier posting on Getting Started with Virtuoso for more on this—so for the data I loaded into Virtuoso, I used URLs that followed the form http://local.virt/DAV/home/joeuser/rdf_sink/myredgraph.rdf for the rdfg:subGraphOf triples.) Update: it works in OpenAnzo as well. When I first tried that, I didn’t know about the -A command line option when querying; see also Lee’s comment below. More on OpenAnzo in an upcoming post.

Now I know that the SPARQL standard and multiple open source implementations support the querying of a set of named graphs without requiring me to list them all, so the use of a large amount of graphs doesn’t sound so unwieldy. This is an important application building block, and next I’ll describe some things that sound sensible to build.

7 Comments

By Lee Feigenbaum on March 11, 2009 1:22 AM

Bob, you said:

“”"
The SPARQL spec explains why I don’t need multiple FROM NAMED clauses to issue a single query against multiple graphs: “the GRAPH keyword is used to match patterns against named graphs. GRAPH can provide an IRI to select one graph or use a variable which will range over the IRI of all the named graphs in the query’s RDF dataset”. So, if I use a variable that ranges over 432 named graphs, I just need a pattern to identify those 432 graphs—ideally, a pattern that still works the following week if I need it to range over 433 graphs, then 434 graphs, and so forth.
"""

I’m not sure this is quite correct. From the SPARQL specification’s point of view, you (or your SPARQL engine) do indeed need to specify the graphs that comprise the RDF dataset against which you are querying.

What makes this tractable is that some stores will, by default, make the default graph the RDF-merge (union, basically) of all of the graphs in the store and also add all graphs in the store as named graphs in the dataset.

Other stores (e.g. Open Anzo) provide a “magic” URI to stand for “all graphs”, or introduce the concept of named datasets, computed datasets, etc. to address this challenge.

Lee

By Fabien Gandon on March 11, 2009 4:27 AM

I thought you might also be interested in the work done in CORESE for graph and paths handling :
http://www-sop.inria.fr/edelweiss/software/corese/v2_4_1/manual/next.php
in particular the nested graph and recursive querying mechanism.

on the RDF side these extensions build on a previous member submission:
http://www.w3.org/Submission/rdfsource/

Cheers,

By Paul Gearon on March 27, 2009 5:46 PM

In response to Lee’s comment:
“I’m not sure this is quite correct. From the SPARQL specification’s point of view, you (or your SPARQL engine) do indeed need to specify the graphs that comprise the RDF dataset against which you are querying.”

No, that’s not actually true. I agree that this *seems* to be what the spec is implying, but it’s not true.

If you use a variable in a GRAPH statement, then the variable will range over the dataset names - all graph names in scope. FROM NAMED does create a scope, but if it is not used in the query then it’s all the graph names.

I made this mistake as well, and was trying to limit my SPARQL implementation to not allow unbound variables in the GRAPH position, but Andy Seaborne set me straight. (Andy is the editor of the SPARQL spec document, and the implementor of Jena of SPARQL for Jena).

By Lee Feigenbaum on March 30, 2009 9:57 AM

Paul, I beg to differ.

The SPARQL specification has no concept of “all the graph names”. In the absence of any explicitly defined dataset, the query is run against a dataset that is chosen by your implementation (your SPARQL engine).

For *some* implementations, this means that the query is run against all the graphs that the engine knows about. For other implementations, this means that the query is run against an empty dataset. For still others, an engine may be hardwired to query specific graphs in the absence of an explicitly given dataset.

This is a very common misunderstanding about the SPARQL specification. I’m guessing that Andy was either speaking specifically about what Joseki/ARQ do or that there was a miscommunication there.

best,
Lee

By Paul Gearon on March 30, 2009 3:13 PM

Hi Lee,

My perspective on this comes from an email conversation with Andy, partly on a mailing list, and partly in private. The context was when I was implementing SPARQL on Mulgara (Andy is on the Mulgara mailing list).

I’m sure Andy won’t mind if I repeat part of one of our private emails here. I had been referring to an unbound variable named “x” which referred to the graph:

?x ranges over the dataset names - all graph names in scope.

FROM NAMED may have created a scope with certain names, but FROM NAMED is not necessary in a query anyway.

Just query with

SELECT ?g { GRAPH ?g { ?s ?p ?o } }

All names of graphs.

In a later (public) email, he replies to a comment from me:

   > It’s interesting you would say this. I wondered about this in the
   > past, and wasn’t satisfied that it was allowed. Also, when I wrote my
   > email this morning, then I looked it up again (so I didn’t look like
   > an idiot - as I am wont to do) and again, I wasn’t satisfied that it
   > was allowed.

Paul,

Examples can’t be exhaustive! We tried to put in as much as we could but not every single case can be an example.

There are tests in the test suite: graph/graph-02 to -09 or thereabouts

Section 12 gives the definition of the evaluation of a graph pattern.

He also gave a final followup:

   > I take it that 12.5 is the area of most relevance here? Specifically,
   > the definition of:
   > eval(D(G), Graph(var,P)) = …..
   > ???

Yes - that’s it: the example in section 8.3.4 is:

   1 PREFIX dc:
   2 PREFIX foaf:
   3
   4 SELECT ?name ?mbox ?date
   5 WHERE
   6 { ?g dc:publisher ?name ;
   7 dc:date ?date .
   8 GRAPH ?g
   9 { ?person foaf:name ?name ;
   10 foaf:mbox ?mbox .
   11 }
   12 }

which is the algebra expression:

   1 (base
   2 (prefix ((dc: )
   3 (foaf: ))
   4 (project (?name ?mbox ?date)
   5 (join
   6 (BGP
   7 (triple ?g dc:publisher ?name)
   8 (triple ?g dc:date ?date)
   9 )
   10 (graph ?g
   11 (BGP
   12 (triple ?person foaf:name ?name)
   13 (triple ?person foaf:mbox ?mbox)
   14 ))
   15 ))))

And being a applicative-order evaluation (graph ?g … ) is evaluated as in 12.5 " Evaluation of a Graph Pattern " then participates in the join. That ?g is unconstrained at the point of evaluating the (graph …).

D, the dataset, can come from the protocol, the query or the execution environment (in that order of priority).

It’s entirely possibly that I’m misinterpreting what Andy is saying here, but my reading of it is that an unbound variable using in a GRAPH expression will evaluate to all known graphs. Also, I’m not saying that Andy is the definitive source for this information, but he is certainly clearer on it than I am. :-)

By Andy Seaborne on March 31, 2009 7:18 AM

In SPARQL, a query executes over a dataset. GRAPH accesses the names in the dataset. If it’s “GRAPH ?g” and then ?g ranges over all names in the dataset. This is in “Definition: Evaluation of a Graph Pattern” (third case). ?g may be constrained in other parts of the query.

Where the dataset comes from is either from a description or the dataset is decided by the query service.

There are two ways to describe the dataset - in the query with FROM/FROM NAMED, or the protocol with default-graph-uri/named-graph-uri. If the dataset is described, the the dataset must be what is described and not more (or worse, different).

If the dataset is not described whatever the service provides, it can be set up is a variety of ways - the case of having the default graph being the manifest and other metadata for the named graph sis quite an interesting set up for the provenance situation.

Bob’s query

PREFIX dc:
PREFIX rdfg:


select ?title 


WHERE { ?g rdfg:subGraphOf 


        GRAPH ?g {?s dc:title ?title}


}

will work on any SPARQL system that can be set up with the dataset having all the named graphs in it and the default graph being the manifest (and other details) of named graphs.

A query service can reject any query it chooses not to answer. Maybe it only processes queries with a description, maybe it only processes queries without a description. The latter case is important because it is the case of providing access to a published dataset over the web. Here, just providing query over that one dataset is what the service does. It might even execute a query if the description matches but there is no obligation for it to do that. After all, the dataset may not be describable if it’s some large relational database fronted by a SPARQL query service.

A graph can appear more than once in the dataset under different names or once as the default graph and also with a name. The default graph or any named graph may be some calculated form of other graphs such as the RDF merge. What matters is that at the time the query is executed, there is a dataset, and that the dataset has a default graph and zero or more named graphs.

Aside: the test assume the default if not mentioned is the empty graph. That’s just convenience for the tests and compatible with the fact that if there is a description of the dataset and the default graph is not otherwise mentioned (FROM, default-graph-uri) then it is empty. The test provide another way to describe the dataset for the purposes of the tests.

I was quoted as saying:

"?x ranges over the dataset names - all graph names in scope." and the “all” here is all names in the dataset because a query is executed over a dataset. See “Definition: Evaluation of a Graph Pattern”.

When Paul says: “my reading of it is that an unbound variable using in a GRAPH expression will evaluate to all known graphs.” this is true where “all known” means all known in the dataset. Not all graphs on the web.

LeeF said: “From the SPARQL specification’s point of view, you (or your SPARQL engine) do indeed need to specify the graphs that comprise the RDF dataset against which you are querying.” is also true but the tricky part is “or your SPARQL engine”. It seems to me that the issue is about what is the dataset. While one could have a dataset which is “all graphs on the web, by name” or some such dataset, no implementation could realise that but still we have a dataset specified. The requirement is to have a dataset when the query executes and to have a set of names that can be iterated over for the evaluation of GRAPH with a variable that is not constrained in any way could not be met.

Some systems have a notion of the graphs that they have in their storage. The dataset description is interpreted as meaning “pick graphs out of the collection of stored graphs”. Nothing wrong with that model, but it’s not required by the specs.

I do think it would be wrong to be overly prescriptive, especially about the default graph. Some systems foce that to be teh RDf merge of the named graphs, and that would preclude Bob’s use case of the default graph having the manifest information.

Finally, we have a new working group running. All that matters is the text in the documents, and not the intent of the text, what implementations do nor what authors thought they meant when writing the text. The workign group comments list is the place to send suggestions for improving the text. Hint, hint.

By Bob DuCharme on March 31, 2009 7:56 AM

Thanks Andy! My use of the default graph for manifest information was just off the top of my head when coming up with the example. It sounds like a specific named graph for these triples would be a good idea.

Bob

Some questions about RDF named graphs

Bob DuCharme — Sun, 01 Mar 2009 11:17:17 -0500

Trying to connect the data structure to real-world use.

Most triplestores support named graphs, and from a high level I can see how they’d be useful, but as I think about using named graphs to address specific application needs, some questions come to mind, so I thought I’d throw them out there.

If graph membership is implemented by using the fourth part of a quad to name the graph that the triple belongs to, then a triple can only belong directly to one graph, right?
I say “belong directly” because I’m thinking that a graph can belong to another graph. If so, how would this be indicated? Is there some specific predicate to indicate that graph x belongs to graph y?
If we’re going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let’s say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 “FROM NAMED” clauses in my query? (Let’s assume that I plan to query those same 432 multiple times.)

I can think of more questions, but I want to wait and see what I can learn about the issues above, and then I can ask better follow-up questions.

6 Comments

By Eric Schoonover on March 1, 2009 1:02 PM

In the repository I am helping to build we have the concept of a graph alias that helps with the overload of named or default graph references in your SPARQL query. It is especially useful if you are going to be executing multiple queries against the same set of graphs. You can assign a single URI that acts as an alias to the 432 graphs you really intend to query and then you can have a single FROM or FROM NAMED clause that points to the graph alias and the SPARQL endpoint will automatically expand the query based on the contents of the graph alias.

By glenn mcdonald on March 1, 2009 1:58 PM

I think this idea of named-graphs being a “physical” (i.e., exclusive, containing) partitioning of the triple-space not only doesn’t make sense, but its failure to make sense is in hilariously exact hierarchical contradiction to the very graph-structured premise of RDF. The relationship between a triple and anything else demands all the same structure and flexibility as anything other kind of relationship. The fourth column in a quad-store should not be graph-name, it should be triple-id. Once a triple has an ID, you can then express anything you want *about* that triple, whether it’s confidence or provenance or batch or saltiness or whatever.

By Jeni Tennison on March 1, 2009 2:48 PM

I’m no expert, but I agree with Glenn Mcdonald, that the fourth column should really be triple-id (as in a unique URI for each triple). Then again, I think named graphs are flexible enough to be used in this way anyway: they just can encapsulate more than one triple if that’s useful.

As far as the questions go: my understanding is that a given triple (as in a unique subject/property/object combination) can belong to multiple graphs. Each graph it belongs to provides a separate ‘row’ in the quad store.

Named Graphs / Semantic Web Activity points to a vocabulary for describing the relationships between graphs (subgraphs, equivalent graphs and so on) at http://www.w3.org/2004/03/trix/rdfg-1/.

I agree with Eric about making your 432 graphs subgraphs or a larger graph which you then query. I guess how you do that depends on the triplestore you’re using. The SPARQL specification has an example of named and default graphs which might be useful as a starting point.

By Chris Booth on March 1, 2009 3:08 PM

I’m no expert, especially about your first two questions, but for your third question it seems to me that you could use a variable for the named graph and then FILTER the results. That might not reduce your 432 individual requests to one, but it might help quite considerably.

By Damian on March 1, 2009 3:24 PM

Oh boy, good questions. Let’s try these ropey definitions first:

Graph: a set of triples.
Named graph: a name, graph pair.
Dataset: a default graph, and zero or more named graphs.

1) No, a triple can be in more than one graph: , . However some stores let you ignore the graphs in certain situations, which require caution to maintain the set-ness of the resulting pseudo-graph. I believe some stores use this as the default graph in SPARQL, which is neither precluded nor suggested by the spec.

2) I don’t understand how a graph can belong to another graph. It might be mentioned (e.g. one graph contains a statement ‘:Bob eg:made ‘)? You may have functional dependencies between graphs ( made from via CONSTRUCT), but that’s up to your application to track. Named graphs are just graphs with names, nothing more.

3) An exciting part of SPARQL :-) In SPARQL you query a dataset, but what determines the dataset? It might be the protocol parameters, it might be the query (your FROM and FROM NAMED), and it might simply be the endpoint that determines it. So don’t expect the endpoint to even pay attention to your FROM NAMED clauses.

The best I can suggest is talk to your store vendor, although you may find FILTERing graphs in or out will do the trick.

Hope this comment helps a little.

By Lee Feigenbaum on March 2, 2009 12:21 AM

Bob,

For the most part, I agree with everything Damian says. That said, since Open Anzo is based on a named graph model, I wanted to give some specific answers based on our experience.

Since my comments were a bit lengthy, I stuck them on my blog:

http://www.thefigtrees.net/lee/blog/2009/03/named_graphs_in_open_anzo.html

Restoring context to shortened URLs in Twitter

Bob DuCharme — Thu, 26 Feb 2009 13:44:00 -0500

Giving me a better idea what tweets are pointing at.

When you have to fit Twitter messages into 140 characters, URL shortening services such as TinyURL and is.gd are handy, but I hate seeing tweets likes this:

This is hilarious: http://is.gd/kSyL

Typical URLs do include information that provides context, starting with the domain name. If someone points to a “great article on [whatever]”, the fact that it’s on nytimes.com versus someguy.wordpress.com gives me a clue about how much I want to read it, so if the description with the URL doesn’t give any meaningful context, I’m not going to follow the link.

Firefox plug-in to the rescue: I recently learned from a @kasthomas about the LongURL expander, which displays the real destination of a URL when you mouse over the shortened version.

Thanks, Sean Murphy, for writing it!

2 Comments

By Sean Murphy on February 26, 2009 4:14 PM

Hey, no problem! I’m glad people find it as useful as I do. Thanks for spreading the word.

By Dan Brickley on February 27, 2009 1:04 AM

Good point, though often enough a popular domain name alone isn’t quite enough to indicate the dangers lurking behind a shortened link…

http://bit.ly/4kb77v

Another plugin from bitly: https://addons.mozilla.org/en-US/firefox/addon/10297

Family filter-related stuff (PICS; POWDER) fit into the landscape here somewhere too…

Sorry Facebook, not these blog postings

Bob DuCharme — Tue, 17 Feb 2009 17:21:30 -0500

This is the last one for which you get a "perpetual, fully-paid right to sublicense, modify, edit, create derivate works and distribute".

There’s been plenty of fuss over the changes to Facebook’s terms of service recently, even in yesterday’s New York Times. Trying to remember which of my friends recently tweeted “All your data are belong to us” on the topic, I searched Twitter this morning and found that dozens of people have done so in the last 24 hours.

Instead of going over the claims and counterclaims about Facebook’s intent, let’s go right to the primary document: the Facebook Terms of Service. (After all, if you’re battling them in court, it won’t carry much weight to say “but your honor, their corporate communications guy told the New York Times that what they really meant was…”) The big issue this week is a sentence that was removed after the following paragraph, but I found unchanged text in the paragraph itself that scared me:

You hereby grant Facebook an irrevocable, perpetual, non-exclusive, transferable, fully paid, worldwide license (with the right to sublicense) to (a) use, copy, publish, stream, store, retain, publicly perform or display, transmit, scan, reformat, modify, edit, frame, translate, excerpt, adapt, create derivative works and distribute (through multiple tiers), any User Content you (i) Post on or in connection with the Facebook Service or the promotion thereof subject only to your privacy settings…

I found their “import a blog” feature handy, so that what I put on bobdc.blog automatically gets published as a Facebook Note as well, thereby reaching more people. But do I want to grant Facebook a perpetual right to create derivative works from any content I post? A transferable, fully paid right, so that they can sell my content to others? I don’t think so. “Subject to [my] privacy settings” isn’t very reassuring; I’m not writing about any particularly private issues, so I don’t want distribution limited to my Facebook friends. As with the posting of my Twitter messages into my Facebook status, I found this automated importing of weblog postings to be a nice convenience, but it looks like the potential cost is too high. I’m disabling the blog import after this shows up as a Facebook note.

Why do I even bother with Facebook? Sometimes it’s a handy way to get in touch with someone whose email address changed because their DSL provider got bought out by another one, and the new one’s rebranding effort extended to changing the domain name in all the customers’ email addresses. I’ve never actually “friended” anyone in Facebook, but I do accept if someone I know friends me.

Henry Story of Sun is working on some technology to allow the Building [of] Secure, Open and Distributed Social Network Applications. I hope that work like this gets us to a point where social networking connections and features, like the web itself, are distributed among data and services that different people choose on their own terms instead of being owned by a single, privately owned corporation that reserves the right to do whatever they want with our content.

February 18 update: It looks like Facebook has not only restored the sentence that everyone worried about losing from the Terms of Service but removed the language above as well.

It’s interesting to compare Twitter’s Terms of Service, which should provide a model for all such services.

Getting started using Virtuoso as a triplestore

Bob DuCharme — Mon, 16 Feb 2009 10:02:33 -0500

The open source edition.

Just about all the RDF triplestores I’ve been trying were designed from the ground up to store RDF triples. OpenLink Software’s Virtuoso is a database server that can also store (and, as part of its original specialty, serve as an efficient interface to databases of) relational data and XML, so some of my setup and usage steps required learning a few other aspects of it first. For example, the actual loading of RDF is done using Virtuoso’s WebDAV support, so I had to learn a bit about that. At first this seemed like another obstacle along the way to my goal of loading RDF and then issuing SPARQL queries against it, but I reminded myself that in a fast, free database server that supports a variety of data models, WebDAV support is most certainly a feature, not a bug.

The possibility of a single server that can store both XML content and RDF triples of metadata about that content could be very interesting for publishers.

After downloading and unzipping the open source edition of Virtuoso for Windows, I found the virtuoso-t.exe server program in the virtuoso-opensource\bin directory. Running this with --help as a parameter showed me the various options for starting it up, including the commands to create a Virtuoso Windows service and to then start up that service.

Once I had this service running, sending a browser to http://localhost:8890/ displayed the product’s Welcome page, and the first choice on the menu on the left side of this screen took me to the Virtuoso Conductor. The Conductor requires you to log in before getting anything done, and the Default Passwords section of the Quick Start & Tours documentation included “dba” in its list of default IDs, so I logged in as the dba.

An HTTP request to load data must specify the ID of the user loading the data, so as the dba I created a new user by picking the System Admin tab, User Accounts, and then “Create New Account” to create a joeuser account. I had some initial trouble configuring this user’s account to let it do all the things it needed to do, but with some help on the virtuoso-users mailing list I learned that I had to check “User Enabled” and “Allow SQL/ODBC Logins”, add the roles SPARQL_SELECT and SPARQL_UPDATE for the user, check “Allow DAV logins”, set a DAV home path of /DAV/home/joeuser/ for this user, and check the DAV folder name’s Create box before clicking the Save button to actually create this user account and WebDAV folder.

For a given user, you can create any WebDAV folder you want, upload RDF data to it, and then load that data to the triplestore (a quad store, actually, to track each triple’s graph) from that folder, but Virtuoso includes a special folder with each WebDAV-enabled account called rdf_sink to automate this process so that once you load an RDF file there its triples get sent right to the quad store.

Once I had created the joeuser account with a password of jupw, the following cURL command loaded the fakeAddrBookPt1.rdf file into the graph named http://localhost:8890/DAV/home/joeuser/rdf_sink (all curl command lines shown here include extra carriage returns for readability):

curl -i -T fakeAddrBookPt1.rdf 
  http://localhost:8890/DAV/home/joeuser/rdf_sink/fakeAddrBookPt1.rdf
  -u joeuser:jupw

(Later, substituting “fakeAddrBookPt2.rdf” for “fakeAddrBookPt1.rdf” in that command loaded this other file into the same graph.) After loading fakeAddrBookPt1.rdf, I went to the SPARQL query form at http://localhost:8890/sparql (the RDF tab of the Virtuoso conductor displays a similar one), entered http://localhost:8890/DAV/home/joeuser/rdf_sink/ as the Default Graph URI to query, and entered my favorite first SPARQL query of “SELECT DISTINCT ?p WHERE {?s ?p ?o}” in the Query text field. Clicking the Run Query button then retrieved a list of predicates from the data that I had loaded into that graph, just as I’d asked for.

Because issuing a SPARQL query with curl reassures me that I really understand a server’s HTTP interface, I also entered the following command to perform the same query and got the same result:

curl -F "query=SELECT DISTINCT ?p FROM 
  <http://localhost:8890/DAV/home/joeuser/rdf_sink/> 
  WHERE {?s ?p ?o}" http://localhost:8890/sparql

Throughout my process of setting this up and trying Virtuoso, I must admit that I did a lot of hunting in the documentation, although as I mentioned I got very good help on the mailing list. There is a documentation PDF file that looks pretty complete—at 15 megs and 2202 pages, it better be!

For my next step with Virtuoso, the RDF Inference in Virtuoso page describes some RDFS and OWL support, but shows that the RDFS and OWL properties must be loaded using special functions instead of just including them as more triples with the data. I’ll probably try it, but I’m also very curious about Virtuoso’s XQuery support—the possibility of a single server that can store both XML content and RDF triples of metadata about that content could be very interesting for publishers.

4 Comments

By Kingsley Idehen on February 16, 2009 4:37 PM

Bob,

Here are some links to posts I made in the past re. Inference rules.

1. http://www.mail-archive.com/public-lod@w3.org/msg00870.html - UMBEL & DBpedia
2. http://www.mail-archive.com/dbpedia-discussion@lists.sourceforge.net/msg00263.html - YAGO & DBpedia

Steps:

1. You load the class hierarchies in question (typically an OWL ontology)
2. You associate an named rule with the named graph hosting the ontology in step 1
3. You execute SPARQL with the inference rule pragma which allows you to select which rules to use for reasoning.

By Mario Kofler on July 1, 2010 11:18 AM

Hello,

I created a new user by picking the System Admin tab, User
Accounts, and then “Create New Account” to create a joeuser

can you please tell me where the button or the link “Create New Account” is located?

i went with the user “dba” to system-admin->user-accounts but i can just watch the users that are already in the system, but do not find a way to create a new account.

i am using Virtuoso 6.1.1

thank you for your help,

greetings,

Mario Kofler

By Bob on July 1, 2010 11:52 AM

It may have changed since February of last year. I would ask on a Virtuoso mailing list.

By Jamshaid Ashraf on September 17, 2010 7:29 AM

Mario,

" > I created a new user by picking the System Admin tab, User\

Accounts, and then “Create New Account” to create a joeuser

can you please tell me where the button or the link “Create New Account” is located? "

You can find “Create New Account” as a link on the column heading of last column of user table. Though it looks like a link for sorting but in fact it is a link to ‘create new user’ (implicitly explicit feature)

reg
Jamshaid

MOTO connects Android to an e-ink display

Bob DuCharme — Sat, 14 Feb 2009 10:05:44 -0500

If I were Jeff Bezos, I'd be nervous.

In The cheap commodity eBook reader of the future, I wrote about how I look forward to a mass-market ebook reader created from an e-ink display and an inexpensive commodity processor. The folks at MOTO have taken a very cool step in this direction by hooking up a processor running Google’s Linux-based Android ( the mobile phone operating system that underlies T-Mobile’s G1 phone) to an e-ink display. MOTO’s announcement about it includes a short video demo.

As I suggested in a comment on their announcement, if MOTO got an ebook-reading program that understood the EPUB format running on that Android processor—which I’m sure is much simpler than the work they’ve already done—it would make $400 for a Kindle look even dumber than it already looks.

1 Comments

By penn on May 29, 2009 9:23 AM

I’ve been trying to decide which e-book reader to get for my mom. I figured I would have to get her both an ebook reader and a laptop, so she could download things (now she downloads ebooks from http://www.ebook-search-queen.com/ ). Knowing that the Kindle doesn’t require her to go to her computer, or even have a wireless network setup, makes my decision easy. For her needs, it is the ideal item. Thank you for this!

Getting started with Sesame

Bob DuCharme — Thu, 12 Feb 2009 09:15:09 -0500

Surprisingly easy.

My efforts to set up and try RDF triplestores have been a bit frustrating. I won’t go into reasons here, because several of the efforts are on hold for now, but my attempts to set up and use Sesame went so quickly and easily that I wanted to write it up right away.

My main goal with any of the triplestores is to load some RDF that will be stored persistently and then run some SPARQL queries against it. I can do some Java coding if I must, but I wanted to see how far I could get with each triplestore without doing any coding (and especially, no compiling). For bonus points, I wanted to see how much inferencing and OWL usage was possible, but in general I wanted to avoid special features in my initial research because I wanted to establish a baseline. (In an ideal world, OWL support would be part of the baseline!)

Installing and running Sesame

According to the installation instructions, the Sesame server software requires Java 5 or later and “a Java Servlet Container that supports Java Servlet API 2.4 and Java Server Pages (JSP) 2.0, or newer”. They “recommend using a recent, stable version of Apache Tomcat”.

It was very easy—the kind of "it just works" experience that's a particular pleasure to find in open source software.

I began by downloading and unzipping the zip file of the 2.2.4 Sesame SDK and Apache Tomcat 6.0.18. The instructions in apache-tomcat-6.0.18\RUNNING.txt to get the Tomcat server up and running were simple and straightforward. To install a Sesame server on top of Tomcat, I copied the two war files from openrdf-sesame-2.2.4\war to apache-tomcat-6.0.18\webapps. After I shut down and restarted Tomcat, sending my browser to http://localhost:8080/openrdf-workbench and http://localhost:8080/openrdf-sesame showed welcome screens about how these apps were running with no problem.

Using Sesame

The Workbench is the form-driven interface to Sesame. It let me create repositories, load remote or local data files into them, and then query them all by picking menu choices and filling out forms. It was very easy—the kind of “it just works” experience that’s a particular pleasure to find in open source software.

The REST HTTP protocol for doing all these things, which I tested using cURL, was well-documented and easy to to figure out. I After I’d created a “test1” repository using the Workbench, the following cURL command line listed repositories and showed that test1 was one of them (all curl command lines shown here include extra carriage returns for readability):

curl -H "Accept: application/sparql-results+xml, */*;q=0.5" 
  http://localhost:8080/openrdf-sesame/repositories

The following loaded some RDF into the test1 repository:

curl -T rdftest2.rdf -H "Content-Type: application/rdf+xml;charset=UTF-8"
  http://localhost:8080/openrdf-sesame/repositories/test1/statements

Once the files triples were loaded, the following request sent a URL-encoded version of the SPARQL query “SELECT DISTINCT ?p WHERE {?s ?p ?o}” to the test1 repository:

curl -H "Accept:  application/sparql-results+xml, */*;q=0.5" 
  http://localhost:8080/openrdf-sesame/repositories/test1?query=
  SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D

The server sent back a SPARQL query result format version of the response.

Inferencing

The steps up to this point were all so easy that I decided to push my luck and try some inferencing. I wanted to:

Store someone’s home phone number and mobile phone number
Declare that both of these properties were subproperties of phone
Issue a SPARQL query saying “give me any phone numbers for this person”

When you create a Sesame repository, there are nine choices for the type of store ranging from “In Memory Store” to “PostgreSQL RDF Store” and “Remote RDF Store”. For my inferencing tests, I picked “Native Java Store RDF Schema”, called the repository rdftest1, and uploaded the following file into it:

<rdf:RDF xmlns:o="urn:schemas-microsoft-com:office:outlook#"
  xmlns:f="http://xmlns.com/foaf/0.1/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">


  <rdf:Description rdf:about="http://localhost:2020/addrbook/RichardMutt">
    <f:firstName>Richard</f:firstName>
    <f:surname>Mutt</f:surname>
    <o:homePhone>463-477-1322</o:homePhone>
    <o:mobilePhone>463-215-8470</o:mobilePhone>
  </rdf:Description>


  <rdf:Property rdf:about="urn:schemas-microsoft-com:office:outlook#homePhone">
    <rdfs:subPropertyOf rdf:resource="http://xmlns.com/foaf/0.1/phone" />
  </rdf:Property>


  <rdf:Property
    rdf:about="urn:schemas-microsoft-com:office:outlook#mobilePhone">
    <rdfs:subPropertyOf rdf:resource="http://xmlns.com/foaf/0.1/phone" />
  </rdf:Property>


</rdf:RDF>

Let’s say I want to call Richard but can’t remember which phone numbers I have for him. The following query asks for the type and number of any phone numbers I have for him, and the little table that follows the query shows what Sesame returned:

PREFIX f:<http://xmlns.com/foaf/0.1/>
PREFIX o:<urn:schemas-microsoft-com:office:outlook#>


SELECT DISTINCT ?phoneType ?phoneNum WHERE { 
  ?s f:phone ?phoneNum; 
  ?phoneType ?phoneNum; 
  f:firstName "Richard";
  f:surname "Mutt".
}

phoneType phoneNum

o:homePhone “463-477-1322” f:phone “463-477-1322” o:mobilePhone “463-215-8470” f:phone “463-215-8470”

(The returned data looks much better with the CSS included with openrdf-workbench.war. A “View Source” of the returned data reveals that it’s provided in the SPARQL query result format with a processing instruction at the top pointing to an XSLT stylesheet that works with the CSS to display the results nicely in the browser.) The links in the copy of the returned data above lead nowhere, but when you’re using it with a running copy of Workbench they let you explore around the data in the repository.

It was particularly cool how many of these steps worked the first or second time I tried them, with no configuration required. I can think of more things I’d like to try with Sesame, but I’m going to keep trying out more triplestores and reporting on the ones that I have much luck with.

14 Comments

By Bruce D’Arcus on February 12, 2009 11:17 AM

On the “really easy to setup a SPARQL endpoint” front, have you taken a look at ARC?

By Bob DuCharme on February 12, 2009 11:39 AM

Thanks Bruce, I will add that to my list, but judging from http://arc.semsol.org/docs/v2/getting_started, it looks like I would need some familiarity with PHP first, which I don’t have, so this means that I may not get to it for a while.

By Dan Brickley on February 12, 2009 3:41 PM

Many thanks for undertaking this survey, the results are really interesting.

I tried a different task with Sesame the other day, but didn’t (yet) win. OK I was tired and maybe missed it in the documentation … but all I wanted to do was load a SKOS rdf/xml document from disk, and explore it via API or (preferably) SPARQL. Couldn’t figure out how.

I wonder whether FOAF+DOAP descriptions of the various SW toolkits could be made available from their providers. Perhaps even with extensions to point out where in their various documents, wikis etc., answers to these common questions can be found.

Also, +1 on reviewing ARC. Assuming you have access to a server with basic PHP facilities, and the username/password for a MySQL db, it should be pretty smooth. ARC is a very impressive package in my experience…

By Bruce D’Arcus on February 12, 2009 8:19 PM

Yeah, I have a problem with PHP as well. But less so than Java ;-)

By Troven on February 2, 2010 6:20 AM

Sesame is a great tool - and so is ARC. ARC has more power when extracting and presenting information in keeping with it’s PHP heritage. Sesame is a very powerful data modelling and storage engine - SAIL being the crown jewels.

By Mihir Shivkumar Wagle on March 2, 2010 10:53 AM

It worked for the native store. But, it didn’t return any results when I tried it with MySQL or PostGres :(

By Stratos on August 29, 2010 9:32 PM

I tried your installation steps and they were very helpful indeed. I reach this problem though:

I have a working apache tomcat.

I downloaded the .jar i got the openrdf-sesame.war and the openrdf-workbench.war from the /war directory and i copied them to my apache tomcat directory /webapps/

After a while i saw 2 new folders created openrdf-workbench/ and openrdf-sesame/

I hit my servers adress http://localhost:8080/openrdf-sesame/ and I get:

HTTP Status 404 -

type Status report

message

description The requested resource () is not available.
Apache Tomcat/7.0.0

Same with workbench.

Any help would be really appreciated!

By Stratos on August 29, 2010 9:35 PM

I forgot to write it (i think). I restarted my tomcat before i checked the address

Thanks again.

By Bob on August 29, 2010 11:06 PM

Stratos,

It’s been a while since I played with it. openrdf.org seems to be down right now, but Google shows that it has a link to a mailing list which would probably be the best place to ask this.

Bob

By michał on September 2, 2010 10:38 AM

Hi Stratos,
I had the same problem (Tomcat 7.0.2 + Sesame 2.3.2). I tried a lot of things but nothing helped. So I decided to install the version as similar as possible to the Bob’s settings. On 2nd Sept 2010 it was: Tomcat 6.0.29 + Sesame 2.2.4… and everything went fine - exactly as in this blogs.

michał

By Jim Smart on September 4, 2010 5:32 AM

I tried installing Sesame 2.3.2 into Tomcat 7.0.2-beta earlier today, and it didn’t work - I downgraded Tomcat 7 to 6.0.29 and then it worked for me :)

Sesame 2.3.2 just doesn’t seem to work out-of-the-box on the current betas of Tomcat 7

By Stratos on September 11, 2010 1:34 AM

I used Tomcat 6 with the latest version of sesame2 and it worked like a charm. Thank you for the help everyone! :)

By Jamshaid Ashraf on October 4, 2010 7:00 AM

Yes, there is some problem with if you deploy sesame 2.3.2 on Tomcat7

The best way to save time is deploy sesame 2.3.2 with Tomcat 6 and every things works as described in this blog

jamshaid\

By Pafka on January 26, 2011 9:58 AM

Hi guys,
Did you try with Tomcat 7 the following URL:

http://localhost:8080/openrdf-workbench/

What can publishing and semantic web technology offer to each other?

Bob DuCharme — Fri, 06 Feb 2009 10:35:24 -0500

That's "semantic web technology", not "the Semantic Web".

Many have wondered about what the semantic web and publishing can offer each other. (By “publishing” here, I mean “making content available in one media or another, ideally to make money”.) After following a lot of writing and discussions in these two worlds—and they are surprisingly separate worlds—I have a few ideas and wanted to write them up where people could comment on them.

What can the publishing world offer to the semantic web?

The less obvious, but to me, the clearest win is what the publishing world can offer to the semantic web: the lessons learned from long practical experience with developing and applying taxonomies, such as identifying useful concepts, naming them, identifying the useful relationships between them, and mapping units of content to those concepts. Many of the if-you-build-it-they-will-come ontologies out there seem to be thrown together in the hope that someone will use them, with no examination of use cases beyond the needs of the individual developers who created them—and sometimes, not even a close look at those needs. Semantic web technology gives us the standards and tools to assign descriptive terms to resources so that people (and software agents) who need those resources can identify them more easily; taxonomy professionals know about best practices for picking good terms to assign that will help the larger project meet specific goals. (For an example of this thinking, see part 1 and part 2 of the article “Creating User-Centred Taxonomies” from the FUMSI group, which is just one of the resources I’ve learned about since I began following the Taxonomy Community of Practice.)

What can the semantic web offer to the world of publishing?

I’ve heard discussions in which publishers picture machine-readable encoded semantics of content driving customers to that content, but this sounds a little pie-in-the-sky for now. (I’d be happy if someone could point me to indications that working examples of this using semantic technology are imminent.) Publishers who want more people to find their content on the web would be better off putting greater effort into basic search engine optimization, and will find solid practical advice in Jamie Lowe’s SEO for Publishers presentation.

Semantic web technologies, as opposed to the grander idea of the Semantic Web itself, offer tools that can help publishers assemble and distribute their content more efficiently, and I think that this low-hanging fruit is a better place to start, if only to get a better idea of the technology’s strengths and weaknesses.

What can an aggregator/publisher do to take advantage of content metadata when the metadata fields for one source's articles don't line up with the fields in another source's articles?

More and more publishing these days is about aggregation. When so much content is available from so many places for free, we’re more likely to pay money for (or put up with ads next to) content selected by people whose judgment we trust. There are many models for aggregation, ranging from print publications such as Utne Reader to grand old online services such as Nexis and Factiva to more Web 2.0-oriented approaches such as Digg and Reddit. Now more than ever, publishers know that metadata makes it easier for both publishing staff and readers to track and connect relevant content, but a problem for aggregators is that while they’re happy to get metadata with the content that they collect, different content sources will send different sets of metadata.

There may be certain fields of metadata that most content chunks have in common, such as Dublin Core fields, but what can an aggregator/publisher do to take advantage of content metadata when the metadata fields for one source’s articles don’t line up with the fields in another source’s articles? Or when the same thing happens with images?

According to traditional practice, the aggregator should put this data into a database that may be built into a CMS or set up as a standalone relational system such as Oracle, MySQL, or SQL Server. In either case, a crucial step in the setup part is deciding what fields you want to track.

Let’s say you define 10 fields of metadata to track. If an article arrives with 12 fields of metadata, but only 8 match fields that you’ve defined, you store those 8, throw out the other 4, and have 2 blanks left over. If, over time, you find yourself throwing out a particular field that more and more content providers have been including with their articles and images, you can modify your database schema or revise the customized fields in your CMS to start collecting that field from that point on, but this is rarely a quick and simple procedure, and all the values delivered to you for that field in content you’ve received up to then are still lost.

The kind of technology developed to support semantic web projects offers an alternative. The RDF triples at the base of semantic technology let you store the fact that a particular resource (for example, a JPEG file) has a field with a particular name (for example, “resolution”) and a particular value for that field (for example, “72dpi”.) Actual resource and field names must be URLs to avoid confusion (I discussed this a bit last week); if you can do this, you can store any metadata about anything. The {resource, field name, field value} combination (more technically known as a subject/predicate/object) is called a triple, and the database managers that store them are called triplestores. Unlike relational database managers and production XML systems, the technology for working with these triples doesn’t need to know about field names in advance. The flexibility that this offers lets developers fit applications around their data instead of shoehorning their data into the current application’s requirements, which can put a lot of constraints on future possibilities for both the applications and the data.

This flexibility does offer the possibility that two publishers might use different field names for the same concept, as Dale Waldt described in the posting I responded to last week, but the OWL part of the semantic web technology stack can help to account for that. For example, what if two publishers use different URLs to indicate the title of an article? If one uses a term from the Adobe XMP namespace to assign an article a http://ns.adobe.com/xap/1.0/Title value of “The Trans-Siberian Railroad”, and the other publisher assigns another article an http://purl.org/dc/elements/1.1/title value of “Across Canada by Train”, a bit of OWL (as demonstrated in my response to Dale) can show that these terms mean the same thing so that a single query for titles retrieves both articles.

If you as an aggregator feel that it would be easier for your suppliers to use a more normalized set of vocabulary terms, get them together and talk about it. This is what standards groups such as OASIS and IDEAlliance are for. (IDEAlliance’s PRISM standard, whose motto is “Developing a standard XML metadata vocabulary for the publishing industry”, is just such a group, and they include an RDF profile as part of their standard.)

Getting More Semantic

If I’m recommending semantic web tools to help you keep track of things such as the resolution of your digital images, you might ask “what’s so semantic about that?” It’s not particularly semantic, but it uses semantic web technology to track metadata that helps your staff and customers more easily find the content that they need, so it does help toward the greater goal. If you want to push this technology a little further to incorporate metadata about the semantics of the content—without spending money on software—look into OpenCalais, which analyzes content and returns a copy with RDF representations of key terms it found and information about what classes those key terms fall into (for example, that “Slumdog Millionaire” is a Movie or that “Golden Globe” is an EntertainmentAwardEvent). I played with the first release of OpenCalais to create the BlogBigPicture website, which uses this metadata to ease navigation of news about Hollywood gossip, investing, the British Premier League, world business, and U.S. politics. You can take the metadata that OpenCalais returns and store it in the triplestore of metadata about your content as easily as you can store information about the resolution of your digital images.

Don’t let the grander ideas about semantics distract you too much just yet, though. Prototypes aimed at lower-hanging fruit will give you a better focus on which of the grand ideas can help your business. There’s plenty of free software available to create these prototypes, and even Oracle provides support for triplestores nowadays. So, if you’re interested in what semantic web technology can do for your publishing business, start thinking about some inexpensive short-term projects that will give you a better idea of the long-term possibilities.

4 Comments

By Ivan Herman on February 7, 2009 3:43 AM

Hi Bob,

On the issue of what the Library world can give to the Semantic Web: another thing is a set of stable URI-s at least in their domain of discourse. A question I have asked before: what is the URI that I can use on the SW to make statements on Bach’s Hohe Messe or Thomas Mann’s novel ‘Joseph and his brothers’? Sure, some of these entities are on wikipedia, hence one can use their DBpedia URI-s. But for many items in the literary or musical world, just to take these two examples, this would not work. References set up to those ‘works’ by major libraries (with suitable sameAs statements if there are different libraries giving URIs to the same work) would be great, and they are in a unique position to do that…

Cheers

Ivan

By Bob DuCharme on February 7, 2009 9:48 AM

Hi Ivan,

Identifying specific editions with a URI is easy: http://www.rfc-archive.org/getrfc.php?rfc=3187, e.g. urn:isbn:1400040019.

For a single URI to represent a work, it’s a lot tougher–are you looking for a single URI to represent “Joseph and his Brothers”, “Joseph und seine Brüder”, “Joseph and His Brothers: The Stories of Jacob, Young Joseph, Joseph in Egypt, Joseph the Provider”, and “Joseph the Provider (Joseph and his Brothers, Young Joseph, Joseph in Egypt)”, (the latter two being titles on Amazon)?

I could ask questions like “If an author revises and/or retitles a work in his or her own lifetime, should the new version get a new URI?” but some committee should be able to work out standards for something like that. The bigger, tougher question is the proper jurisdiction of URI assignment for a particular work:
if a publisher assigns an ISBN number to an edition of a work that they’re publishing, who would assign a specific URI that covers multiple editions from multiple publishers to a work like this Mann novel?

thanks,

Bob\

By Taylor on February 8, 2009 9:37 AM

“Don’t let the grander ideas about semantics distract you too much just yet, though. Prototypes aimed at lower-hanging fruit”

Good advice Bob. I’ve placed my own goal of being the URL provider Ivan mentions, but for the travel space, to creating an internal semantic tool for tracking software/machines and processes. I’ve been able to engage more support and promote the idea with tangible, visible benefits better.

By Dr David Shotton on April 30, 2009 8:43 AM

Dear Bob,

Thanks for your useful comments. I would like to bring to your attention three articles that I have published in April 2009 on the subject of semantic publishing, in the hope of contributing to this debate, detailed at http://imageweb.zoo.ox.ac.uk/pub/2008/publications/Shotton_Articles_on_Semantic_Publishing.pdf

An excellent critique of what we describe in the first of these articles is given by Rod Page at http://iphylo.blogspot.com/2009/04/semantic-publishing-towards-real.html, in which, in essence, he correctly says we did not go far enough in terms of making machine-readable data and metadata available, thereby failing to contribute to an ecosystem of linked data (http://linkeddata.org/).

The third article, on our Citation Typing Ontology, also comments indirectly on the problem discussed in your blog by Ivan and yourself about URIs for particular works. I believe the issues surrounding this problem of URIs are best clarified by adopting the FRBR classification (http://www.ifla.org/publications/functional-requirements-for-bibliographic-records; http://www.frbr.org/; http://en.wikipedia.org/wiki/FRBR), developed by librarians to distinguish works, expressions and manifestations. URIs are most conveniently used to refer to expressions of works - the same items to which DOIs refer (http://www.doi.org/).

I hope you find these papers interesting and helpful.

Kind regards,

David

Publishers and semantic web technology

Bob DuCharme — Thu, 29 Jan 2009 13:47:14 -0500

A response to Dale Waldt's Gilbane XML posting on semantics and the web.

My old friend Dale Waldt (I remember, immediately after the announcement of the existence of XML at SGML 1996, going up to my then-coworker Dale and asking “So what do we think?”) recently posted an entry on the Gilbane XML blog titled Why Adding Semantics to Web Data is Difficult. A few days ago I posted a comment saying that the things that he saw as missing from semantic technologies are actually already there and working well, but my reply hasn’t shown up yet, so after a bit of revision, I’m putting it here. For my blog entry categories, I’ve put this under “Publishing” because most of what I’ve written below is already familiar to people in the semantic web world, but not as widely known in the publishing world.

Dale wrote:

Consider though, that the schema in use can tell us the names of semantically defined elements, but not necessarily their meaning. I can tell you something about a piece of data by using the tag, but how, in a schema can I tell you it is a net calculated using the guidelines of US Internal Revenue Service, and therefore suitable for eFiling my tax return? For that matter, one system might use the element type name <net_income> while another might use .

This is why the semantic web is built around URLs, not just element names. If someone refers to a “title” and you don’t know whether that person is an HR administrator who means “job title” or a realtor referring to the deed to a piece of property, you don’t know what they mean. However, if I refer to a http://purl.org/dc/elements/1.1/title, you know that I mean the title of a work or resource, because the URL makes it clear that I’m referring to the Dublin Core sense of the term.

The things that Dale saw as missing from semantic technologies are actually already there and working well.

As I understand it, XBRL’s goal was not to standardize the vocabularies of element type names as much to standardize ways of identifying them. For example, in GE’s XBRL financial statement, they chose to identify net income with the URL http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome and have this declared in a filed document. Instead of encouraging everyone to create their own new vocabularies, though, the XBRL effort did create a set of US GAAP taxonomies, and these are forming a core set of documented, commonly understood terminology for U.S. accounting.

How will we know that elements labeled with <net_income> and are the same and should be handled as such?

Let’s assume that company X uses the term “net_income” and company Y uses the term “inc”. When they publicly define what they mean by these terms using OWL ontologies or XBRL taxonomies, they avoid the confusion you describe by defining them with URLs, just as the OCLC did for Dublin Core terms, so let’s say the terms’ full names are http://www.x.com/ns/xbrl/net_income and http://www.y.com/some/path/inc. (Of course, if an XML document includes the namespace declarations xmlns:x=“http://www.x.com/ns/xbrl/" and xmlns:y=“http://www.y.com/some/path/", the element names can use the abbreviations x:net_income and y:inc.)

The following bit of OWL asserts that they’re both the same as GE’s term for net income, and a SPARQL query that uses the GE URL to say “get me net income figures” will get the others as well:

<owl:ObjectProperty 
  rdf:about="http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome">


  <owl:equivalentProperty>
    <owl:DatatypeProperty rdf:about="http://www.x.com/ns/xbrl/net_income"/>
  <owl:equivalentProperty>


  <owl:equivalentProperty>
    <owl:DatatypeProperty rdf:about="http://www.y.com/some/path/inc"/>
  </owl:equivalentProperty>


</owl:ObjectProperty>

This nicely demonstrates the potential of OWL as metadata that adds value to existing bodies of data.

OWL has been a standard for four years, and there are several implementations available that let you do this. (Speaking of semantics, in addition to defining such equivalences, OWL can also encode semantics.)

The great thing about OWL’s relationship to XBRL is that much of XBRL is about defining taxonomies and semantics, and OWL is about building on such definitions to get more value out of data.

Obviously a industry standard like XBRL (eXtensible Business Reporting Language) can help standardize vocabularies for element type names, but this cannot be the whole solution or XBRL use would be more widespread.

XBRL helps to standardize naming within the world of business reporting, but the need for vocabulary definition standards and tools goes well beyond that world. (The full set of XBRL specs is also a complex solution to a complex problem, which slows the adoption from getting widespread very quickly.) The goal of RDFS was to help people define such vocabularies, but OWL provides a superset of RDFS and offers more slick tools, so people sometimes build OWL ontologies when they only need an RDFS vocabulary.

I think the Semantic Web will require more than schemas and XML-aware search tools to reach its full potential in intelligent data and applications that process them. What is probably needed is a concerted effort to build semantic data and tools that can process these included browsing, data storage, search, and classification tools.

For data storage and search, commercial and open source triplestore tools are available. (I recently mentioned that I’ve been blogging less because I’ve been looking into them.) For browsing, new semantic web Firefox plugins crop up all the time. I’ll discuss classification next week, but as a hint, it turns around the question of what semantic web technology can bring to the publishing world—it’s more about what they can learn from the publishing world.

2 Comments

By Peter Keane on January 29, 2009 4:12 PM

The Semantic Web ideals, while quite exciting, have always struck me as too much of an all-or-none proposition: either my data is part of this universal graph of knowledge or it isn’t, based on whether I have encoded my data in triples (RDF, RDFa, etc). But it is not always the consumer that needs help “understanding” my data’s place in that graph – I as the producer do as well. And semantic assertions (i.e., this tag equals dc:title) take time and understanding, which many/most do not have. What they DO have is domain knowledge. E.g., “Here’s what the figures in this column of ths spreadsheet I am publishing on the web as an HTML table mean.”

I’d love to see tools that allow publishers to make their data “smarter” over time – not as an all-or-none proposition. Yahoo’s Search Monkey and perhaps GRDDL (?) are perhaps steps in the right direction. As another example, tagging seems to be a fairly easy-to-grok and easy-to-implement feature. How about more focus on something simple like tagging and tools that allow the publisher to then create equivalencies between a tag on their site and some domain-specific ontology if such exists (and probably best we don’t use the word “ontology” ;-)). My take (influenced by working w/ folks in higher ed) is that folks are willing to do a bit of work to “rationalize” their data, esp. if they gain some benefit. But not a lot of work, and especially not if they need to understand a whole new world of knowledge representation.

Our approach has been to create a system (theoretically) as easy to use as FilemakerPro, Microsoft Access, Excel, etc. Users can create arbitrary sets of “attributes” for their collections of digital things (audio, images, video, documents, web pages) and then assign values as they wish. They may start with just a title and date, but when possible they may add much more detailed metadata. And commercial sets of, say, images+metadata are easy to incorporate as well.

Everything is stored in a backend with Atom/AtomPub interfaces in and out. The key-value pairs are simply held in atom:category elements – one atom entry for every item in the system. Many of these collections do, in fact, map to existing metadata schemes, VRA Core4 for images, for example. But indeed, eveything has a scheme, if only local to that one users collection. Much is gained here in terms of interoperability (Google spreadsheets is becoming a favorite data creation tools, since it is so easy to “import” into our system), preservation, data portability, etc. And if/when a set of data needs to enter the cloud of linked data, asserting the equivalencies and serializing to RDF is quite easy.

I guess my point is that there is some low-hanging fruit on the way to the Semantic Web that does not require publishers to join up here and now. Simply thinking in terms of regularizing metadata schemes, data portability, simple xml-based formats (Atom +1) get us a very significant way along a useful path. Not, certainly, the whole vision of the Semantic Web but quite useful nonetheless.

By Frank Gilbane on January 29, 2009 6:44 PM

Bob:

Sorry your comment didn’t show up. I just found it in the comment spam folder, published it, and sent Dale an email.

Frank

Playing with some RDF stores

Bob DuCharme — Mon, 26 Jan 2009 22:33:48 -0500

Instead of blogging.

I recently realized that most of my experience with RDF has been with tools that load triples into memory and then work with them there, so I’ve decided to get to know the disk-based triplestores out there better: Jena, Joseki, Sesame, AllegroGraph, OpenLink, Mulgara… let me know if I’m missing anything here.

This is consuming just about all of my free time at the computer (of which I have little lately because of some very long hours for the employer), so I’ve had a lot less time to write for the weblog. When I’ve gotten further with this research, though, I’ll have a lot to write about.

7 Comments

By Eric Schoonover on January 26, 2009 11:16 PM

I’d really encourage you to take a look at Intellidimensions Semantic Server product. It runs on top of SQL server (any edition including Express) and there is a 60 day trial version. They also have a free academic license.

http://www.intellidimension.com/

Note: make sure if you are using SQL Server Express that you pull down the advanced version that includes full text indexing.

By Leigh Dodds on January 27, 2009 7:23 AM

Hi Bob,

If you want to add the Talis Platform to your list of services to explore, then just drop me a mail and I’ll get you set up with a developer account.\

By Bobx DuCharme on January 27, 2009 8:14 AM

Thanks Leigh!

Eric, I may try that, since I do have a copy of SQL Server running.

Bob

By Lee Feigenbaum on January 29, 2009 12:27 AM

Bob,

You’re also welcome to try out Open Anzo - http://openanzo.org - but really, most of these stores aren’t (or shouldn’t be) directly comparable. Most of them have their sweet spot(s), whether it be raw speed, clustering/scalability, federation, enterprise features, collaboration, full-on inference, lightweight (e.g. RDFS) inferencing, etc.

Horses for courses, and all that.

Anyway, drop me a line if you’re interested to hear more about my personal take on various stores’ sweet spots, or, even better, I’d love to hear what you think after you’ve played around some. :-)

Lee

By Taylor on January 29, 2009 11:58 AM

Bob,

I’m totally biased, I like Jena…most of all because I don’t have enough time to learn more that one semweb tool ;-), but also because they have some SPARQL heavy hitters and it supports the latest greatest sparql goodness (updates). So to tempt you along I’ve edited this example for you…

http://tinyurl.com/bzuqrk

I’ve been playing with different ways to code to the jena api, this example is jenabean’s “Thing” which uses simple interfaces to simplify asserting new triples. It makes it easy to polymorph into various vocabs, the library comes with just a few, but it’s very easy to create more.

By Martin Brousseau on February 4, 2009 5:37 PM

Hi Bob,
Don’t forget to add BigOWLIM and Virtuoso to your shopping list.
Known to be among the most scalable triple store. BigOWLIM is using the Sesame api.

By Bob DuCharme on February 5, 2009 12:34 PM

Thanks Martin. Virtuoso was on my list from the beginning. By adding an OWL layer to Sesame, BigOWLIM looks very cool, so I will definitely be playing with it.

Our long national nightmare is over

Bob DuCharme — Mon, 19 Jan 2009 23:08:25 -0500

According to a highly specialized hardware device.

2 Comments

By Prateek on January 20, 2009 1:59 AM

ROFL.. =)

By Timothy Horrigan on January 20, 2009 10:38 AM

Actually Bush II’s term doesn’t run out till exactly noon DC time.

In 1989, as they usually do, the incoming Vice President (Dan Quayle) took his oath of office about 5 minutes before noon, and then they had about 10 minutes of music and pomp and circumstance before the incoming President (Bush I) actually took his oath. During that interim, President Reagan fell asleep— not for good, it was just a little mini-nap. I was wondering what would happen if Reagan died before Bush took the oath… might Quayle become President. At that moment, the anchorperson (probably Tom Brokaw, but maybe Peter Jennings or his CBS counterpart) took the mike to assure us that Bush automatically became President at the stroke of noon, regardless of whether or not he had been sworn in yet.

Displaying a message box from the Windows command line

Bob DuCharme — Sat, 17 Jan 2009 13:19:26 -0500

With no special software or compiling; just a little scripting.

When I run a time-consuming batch file that executes perl scripts or XSLT stylesheets on hundreds of files, I usually end the batch file with an echo command with only a Control-G as its output, so that a beep lets me know that the job is done. Processing some client files while watching Mark Birbeck speak at XML 2008, I knew it would be rude to have my computer emit such an obnoxious beep, so I found a nice alternative: a command line way to display a message box about my task being finished using only native Windows features.

First, I needed a short Windows JavaScript script like this, which I called msgbox.js:

if (WScript.Arguments.length < 1) {
    msg = "No message supplied"
}
else {
  msg = "";
  for (i = 0; i < WScript.Arguments.length; i++) {
      msg = msg + WScript.Arguments.Item(i) + " ";
  }
}
WScript.Echo(msg);

If it’s invoked with any arguments, it displays them as the text of a message box.

Then, I wrote this one-line batch file, which I called msgbox.bat, to call the JavaScript script:

wscript \util\msgbox.js %*

WScript is the more Windows-oriented sibling of CScript, the Windows JavaScript engine that I’ve written about before. They’re both included with Windows.

Now, if I end a batch file with this line,

msgbox Yo! The fixfiles.bat batch file is all done.

the message box shown above displays.

Of course there are dozens of other ways to display a message box, but it’s always nice to find a way to do something useful with minimal code and no downloaded or newly purchased software.

2 Comments

By gonzalo rodriguez on November 26, 2009 6:51 PM

Greetings:
Thank you very much for that information, did not know WScript.

By Alexander on January 25, 2010 5:38 AM

Hi, thanks for the post. I have looked for a fast and simple way to display a message box under Windows. And I have found it ! =)

Hey CNN, SPARQL isn't so difficult.

Bob DuCharme — Thu, 08 Jan 2009 09:19:12 -0500

And like any programming language, it doesn't have to be convoluted.

In the December 17th cnn.com/technology article Making sense of the ‘semantic Web’, Steve Mollman wrote:

Consider, for instance, SPARQL, a query language. To find, say, music artists associated with the producer Timbaland, you’d have to type a long piece of convoluted code that most of us wouldn’t bother to do.

Steve: try pasting the following into DBPedia’s SPARQL Query Form and then clicking the “Run Query” button:

PREFIX d: <http://dbpedia.org/property/>


SELECT DISTINCT ?artist WHERE {
  ?album d:artist ?artist.
  ?album d:producer <http://dbpedia.org/resource/Timbaland>.
}

Or just click here.

That wasn’t so bad, was it? I could make it shorter, but if you’re not familiar with basic SPARQL queries, you might consider the shorter version more convoluted. Of course, it doesn’t have the elegant clarity of this bit of JavaScript included as part of your article’s web page:

if(cnnWinExtraRegExp.test(cnnWinExtra)){var cnnOmniExtra = 
  cnnWinExtraRegExp.split(cnnWinExtra);cnnWinLoc = cnnWinLoc + cnnOmniExtra[0];}
else {cnnWinLoc = cnnWinLoc + cnnWinExtra;}}
if (typeof(cnnPageName) != "undefined") {s.pageName = 
  cnnPageName;s.eVar1 = cnnPageName;} else {s.pageName = cnnWinLoc;s.eVar1 = cnnWinLoc;}
if (typeof(cnnSectionName) != "undefined") {s.channel=cnnSectionName;s.eVar2=cnnSectionName;} 
else {s.channel="Nonlabeled";s.eVar2="Nonlabeled";}
if (typeof(cnnSubSectionName) != "undefined") 
{s.server=cnnSubSectionName;s.eVar3=cnnSubSectionName;} else {s.server="";s.eVar3="";}
if (typeof(cnnSectionFront) != "undefined") {s.prop1=cnnSectionFront;} 
if (typeof(cnnContentType) != "undefined") {s.prop4=cnnContentType;s.prop6=s.pageName;}

As with SQL and other query languages, no one expects end users to type out SPARQL queries like the one above, but someone who already knows a scripting language or two can pick up SPARQL and use it to build new kinds of applications. Like the JavaScript included in your article’s web page, SPARQL will play an increasingly valuable role in bringing information to people.

And being Elvis’s birthday today, I’d like to express my hope that the next time someone does another updated remix of an Elvis tune to follow “A Little Less Conversation” and “Rubberneckin’,” I hope it’s Timbaland.

2 Comments

By carmen on January 8, 2009 10:15 AM

theres something to be said about not inventing another syntax.

metaweb has shown how SPARQL-like queries can be formatted in JSON. this makes it much easier for end-users of browser-based tools.

agree with the original point of the post. why do these cnet/nyt/cnn type places like to make sweeping remarks?

By Bob DuCharme on January 8, 2009 11:36 AM

Do you have a URL for that?

I must admit, I’ve been hearing “SPARQL-like” from so many different directions that I really prefer to go with the actual standard. The syntax isn’t so bad, and there are a lot of implementations out there.\

Turtles all the way down

Bob DuCharme — Tue, 30 Dec 2008 14:13:34 -0500

A nice early version, without the turtles.

I’ve been reading The Education of Henry Adams because I heard that this descendant of two US presidents had some interesting perspectives on the effects of technological progress on peoples’ lives—in his case, in the latter half of the 19th century, when things changed more than they have in the second half of the 20th. Near the end, he quotes the French mathematician Henri Poincaré:

Doubtless if our means of investigation should become more and more penetrating, we should discover the simple under the complex; then the complex under the simple; then anew the simple under the complex; and so on without ever being able to foresee the last term.

It reminds me of the turtles all the way down story, whose earliest mentions come several decades after Poincaré and Adams. Wikipedia has a nice overview of the various location/lecturer/audience-member attributions included in popular versions of this story of the earth’s cosmology.

A belated Christmas wish: a SPARQL endpoint for Digg RDF

Bob DuCharme — Fri, 26 Dec 2008 14:47:00 -0500

Or consider it a lazy semweb wish.

I’ve been looking for a SPARQL endpoint that provides new data fairly regularly—not just new triples to query, but data that is new to the world, such as from a stock ticker feed. If the RDFa on digg.com pages was accumulated in a database that could be queried as a SPARQL endpoint, that would certainly qualify, and it would be fun to play with.

2 Comments

By Kingsley Idehen on December 29, 2008 11:25 AM

Bob,

Please try:
http://demo.openlinksw.com/sparql or /isparql

This instance of Virtuoso includes are in-built Sponger Middleware.

The Sponger Middleware converts a plethora of none RDF resources into RDF based Linked Data “on the fly”.

The Sponger is also integrated into the Virtuoso SPARQL processor so you can put any resource URL in the “FROM” clause of a SPARQL query. The effect is that you can SPARQL against any Web resource URI via a Virtuoso sparql endpoint.

1. If a local graph IRI matching the resource URL doesn’t exist, the Sponger will crawl the resource
2. The localized resource is then RDFized (we have RDFizers aka. Cartridges for about 30 different data source types which includes Digg)
3. The Graph IRI for the sponged resource is always the same as the original resource URL.

Basically, the Sponger is like a Driver Manager, but instead of dealing with relational data (ala. ODBC, JDBC etc..) it offers dynamic binding to RDF Drivers / Providers / Cartridges which take on the duty of transforming negotiated resource representations into RDF based Linked Data.

Sample links:

1. http://tinyurl.com/6wu8nt - this is an ODE page which is the output of SPARQL passed through an HTML template for browsing
2. http://tinyurl.com/7qagne – raw SPARQL endpoint variant

Additional information:

1. http://virtuoso.openlinksw.com/presentations/Virtuoso_Sponger_1/Virtuoso_Sponger_1.html
2. http://virtuoso.openlinksw.com/Whitepapers/pdf/sponger_whitepaper_10102007.pdf

By drewp on January 2, 2009 11:43 AM

I put up an endpoint for one of my projects. It’s not exciting data, but it is new each day.

announcement
http://drewp.quickwitretort.com/2009/01/02/0

endpoint
http://whatsplayingnext.com/sparql

Adding metadata value with Pellet

Bob DuCharme — Mon, 22 Dec 2008 09:37:09 -0500

A nice new feature of Pellet 2.0.

The open-source program Pellet is described as an OWL reasoner, but I’ve used it mostly as a SPARQL engine that happens to understand OWL. So, for example, if I have RDF that says “Loretta’s spouse is Leroy and spouse is a symmetric property,” but the data makes no mention of Leroy’s spouse, and I ask Pellet “who is Leroy’s spouse,” it can give me the answer.

Most SPARQL engines can’t do this kind of OWL inferencing, and I thought it would be cool if Pellet could read a batch of RDF with some facts and some OWL properties, infer what it can, and then write out a copy of the RDF with all the implicit facts made explicit. This way, the less intelligent SPARQL engines could take advantage of the inferred data. It’s one of those holy grails in publishing technology: a process that reads data and adds value to it (in this case, by adding new facts that weren’t there before) and then writes out the data in a standard format so that other programs can use it. Pellet 2.0’s new extract subcommand now makes this possible.

First, let’s review how Pellet would run a SPARQL query against some sample data and infer a new fact to answer a question that a non-reasoning SPARQL engine could not answer. The following RDF/XML sample has a few facts about Leroy and Loretta and specifies that the spouse property is symmetric (that is, that if X is the spouse of Y, then Y is the spouse of X):

<-- spousedemo.rdf -->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:owl="http://www.w3.org/2002/07/owl#"
         xmlns="http://www.snee.com/ns/abook#">


  <rdf:Description rdf:about="L1">
    <first>Leroy</first>
    <last>Lockhorn</last>
  </rdf:Description>


  <rdf:Description rdf:about="L2">
    <first>Loretta</first>
    <last>Lockhorn</last>
    <spouse rdf:resource="L1"/>
  </rdf:Description>


  <owl:ObjectProperty rdf:about="http://www.snee.com/ns/abook#spouse">
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#SymmetricProperty"/>
  </owl:ObjectProperty>


</rdf:RDF>

Leroy has no spouse property, and if I tell a SPARQL engine such as ARQ to run the following query against the RDF above to ask who Leroy’s spouse is, it won’t have anything to tell us. Old or new versions of Pellet, though, will read this query and tell us that Leroy’s spouse is Loretta Lockhorn because that information is available to it after it uses the extra OWL metadata to infer what it can.

PREFIX a: <http://www.snee.com/ns/abook#>
SELECT ?spouseFirst ?spouseLast WHERE {


       ?s a:first  "Leroy";
          a:last   "Lockhorn";
          a:spouse ?spouse.


       ?spouse a:first ?spouseFirst;
               a:last  ?spouseLast.
}

Pellet 2.0’s extract subcommand reads RDF, does any inferencing it can from included OWL metadata, and then writes out RDF that includes the inferenced data. The following command line shows how I used it. (Additional command line parameters let you control just how much inferenced data Pellet adds when doing this.)

pellet extract --input-format RDF/XML spousedemo.rdf  > temp.rdf

This copies all the triples from spousedemo.rdf to temp.rdf and includes new data such as the bolded part in the following (the “j.0” prefix is assigned to the URL that was the default namespace in spousedemo.rdf):

  <rdf:Description rdf:about="http://www.snee.com/ns/ID#L1">
    <j.0:last>Lockhorn</j.0:last>
    <j.0:first>Leroy</j.0:first>
    <j.0:spouse rdf:resource="http://www.snee.com/ns/ID#L2"/>
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
  </rdf:Description>

If I ask ARQ to run the query shown earlier on temp.rdf, it can tell me the name of Leroy’s spouse, because Pellet’s extract subcommand has made temp.rdf a richer data file than spousedemo.rdf.

Declaring the spouse property to be symmetric is just a small bit of metadata added to the data shown in the file. OWL can add all kinds of metadata, and Pellet now makes it even easier to take take advantage of that metadata.

For me, this small bit of metadata also proves something important about the value of semantic technology: while it would be silly to try to encode all the semantics of the word “spouse” in a machine-readable form, encoding just this small bit of the word’s semantics—that it’s a symmetric property—can add value to data and let you answer questions that you couldn’t answer before.

1 Comments

By Kendall Clark on December 30, 2008 10:36 PM

Nice post, Bob! This sort of use of OWL and RDF is just the kind of insanely boring but incredibly useful thing that too often gets overlooked. ;>

Picking XML schemas and tools?

Bob DuCharme — Tue, 16 Dec 2008 03:46:40 -0500

Then first think about your content and users.

At last week’s XML in Practice 2008 conference, I joined Micah Dubinko, Evan Lenz, and Frank Miller for the panel on working with authoring tools and schemas. (Lisa Bos of Really Strategies did a fine job hosting the panel; she should consider doing one of those interview podcast shows.) The panel’s full title mentioned both DITA and DocBook, and while Mark Shellenberger predicted a “cage match,” several people later seemed disappointed that there weren’t more DITA/DocBook partisan sparks flying. I prefer not to take identity politics to the point of identifying myself with only one technical content schema, and I think that Micah, Evan, and Frank felt the same way. (I’d love to be the moderator of a Norm Walsh/Eliot Kimber discussion on DocBook/DITA issues, though.)

A schema is metadata whose job is to add value to data.

When it was my turn to introduce myself and my background, I wanted to draw a connection from the panel topic to the services of my employer, Innodata Isogen, so I mentioned that we had a lot of experience helping publishers find a good fit between their content, schemas (sometimes DocBook, and sometimes DITA!), tools, and users. This was off the top of my head, but I thought about it more as Evan introduced himself and jotted in my notebook: “content/schemas/tools/users”.

People often want to know what the best schema is, or the best editing tool. I got to thinking about how the best way to determine those two parts of the content/schemas/tools/users lineup is to take a good hard look at the other two.

Content analysis is underrated. People discuss the virtues of one schema or another as if the schema by itself will do something for them, but a schema is metadata whose job is to add value to data. If you’re wondering how well each of three schemas fits your content, then type and paste some of your content into documents that conform to those schemas and see for yourself. Once, while helping the PRISM standard group think through a content DTD to go with their metadata spec, I typed up Entertainment Weekly interviews with Will Smith and Tommy Lee Jones the week that “Men in Black II” came out (Entertainment Weekly is a Time Inc. publication; so is Mad Magazine, as I found out during a PRISM meeting in their building) in DocBook and one or two other DTDs that I can’t remember right now. Doing this makes it much clearer whether the schema has the data and metadata elements and attributes you need and if its required structures fit your structure.

To frame any thoughts about users of the authoring tools and schemas, consider the two extremes: on the one hand, especially if you’re in aerospace or some other heavy industry, you might have a staff of users who use powerful, higher-priced, editing tools because that’s the job specialty you need from them. If it’s not their job specialty, and you need it to be, you arrange training. In the other extreme, you might be a legal publisher whose authors include your country’s leading expert on bankruptcy, and you’re happy enough to publish this author’s treatise on bankruptcy law that if he or she turns it in on floppy disks with WordPefect 4.2 files, you’ll do what you must in order to convert that content into the XML that you use to drive your publishing system. If you gave this author an $800 XML authoring tool and a week of training, you’d probably annoy this valued author more than anything else.

Most content creators in XML publishing scenarios fall between these two extremes. There are a lot of them who are comfortable with Word but who can be convinced to use something similarly WYSIWYGgy that imposes the structure you need, but you might not have $800 plus training costs to spend on them. Don’t lose heart; there are alternatives.

Once you have a better idea of what your content needs and your users and budget can handle, it’s easier to think about the best schema and tools for your system. You can remove the tools question from consideration if you contract with a business partner to create the XML for you; you specify the schema you want (or work with them to determine the best one) and the quality levels you want, and then they do it for you. One of Innodata Isogen’s newer services that we’ve had increasing success with is in content origination and authoring. For people wondering about how to work with another firm to have them take on these tasks, innodata-isogen.com’s “Knowledge Center” has a new section titled Outsourcing Content Origination and Authoring: Closing the Publishing Loop, which includes a white paper covering the issues, upcoming webinars to listen to people with long experience with this, and more.

1 Comments

By Mark Shellenberger on December 16, 2008 11:43 PM

You must admit the panel had the potential for sparks.

It is a testament to the professionalism and skill of the panelists that it didn’t devolve.

I particularly agree with your statement “content analysis is underrated”. Too often people say “I want to use X schema/DTD” without having looked at their data to see if that makes any sense. And it is very difficult to convince them otherwise, even after doing some of that analysis.

Thanks for fleshing out some of the things you said during the session.

Looking forward to XML 2008

Bob DuCharme — Fri, 05 Dec 2008 12:44:25 -0500

And seeing some friends and learning about new developments.

The first time I went to the annual conference that will be called XML-in-Practice 2008 this year (but which I think of as “XML 2008”), it was called SGML ‘95. It grew from there and morphed into an XML conference, and when the dot com boom supported several XML conferences a year, this was the best and biggest. It’s slimmed down over the years, and I hate to admit that I might not go if it was going to be a conference full of strangers, but I know I’ll see some old friends, and the chance to bounce ideas off other XML geeks in person is still very appealing, especially when it’s a two-hour drive away from home.

The presentation grid has interesting looking things from both friends and strangers:

I’ve been more interested in taxonomies lately, and a presentation like Guthrie Collins’s Using XML at The Associated Press for Taxonomies and Revenue Generation will cover content relevant to everyone—newspaper articles—from a leader in the field, and hearing how they combine taxonomies with repurposing to generate new revenue should be very interesting.
I haven’t had a chance to check out the ubiquity-xforms project hosted at Google Code, so I look forward to seeing Mark Birbeck demo and describe it in his Declarative Ajax programming with Ubiquity Xforms presentation. An AJAX-friendly open source implementation could be just what Xforms needs to give it greater traction.
The description for Lisa Bos and Chandi Perera’s Driving XML workflows through Creative Suite makes it look like they’ve found one of those holy grails that people often seek in publishing: round-tripping between Word, XML, and a CMS with serious XML support.
At 3 on Monday, I’ll probably stray away from the publishing track to hear about Authoring and Publishing Legislative Documents in XML at the US House of Representatives and the Senate. (In the same time slot: Accelerating DITA with OmniMark. I didn’t know that Omnimark was still around.)
I know the basics of Schematron pretty well, but it will be fun to hear Wendell Piez explain it.

Other topics that look interesting: Tony Coates’ UBL panel, Priscilla Walmsley on new features of XSLT 2.0, Microsoft’s new schema editor, the use of the open source XQuery database eXist in US intelligence agencies, and the semantic web panel with Mark Birbeck, Ron Reck, and Ken Sall. The grid schedule has a confusing description of two (combined?) panels on “Working with authoring schemas” that is probably one big panel, and with Norm Walsh listed as a panelist, I’ll have to check that out.

Ken Holman is the chair of the track where I’ll speak on Automating Content Analysis with Trang and Simple XSLT Scripts Tuesday afternoon, so I’ll have to be careful what I say about XSLT. (It will be interesting to see how he can provide an Intro to XML, XSLT, and XSLFO in 60 minutes…)

I can’t say that I’m that pumped up to see the former CEO of Musak give the main keynote, and who’s going to get up early enough on Tuesday morning to see a 7:30 AM “Premier Sponsor Presentation #1” that hasn’t even been booked yet? I won’t, but I look forward to learning a lot Monday and Tuesday and maybe even Wednesday.

6 Comments

By Eamonn on December 5, 2008 1:20 PM

My first SGML conference was in 1995 (SGML Europe in the beautiful Austrian town of Gmunden) where I met Charles Goldfarb - what a great way to start. Gmunden is one of those places that you put on the ‘must revisit sometime to see if its still as wonderful as I remember it’ list. Sadly, the European XML conference has morphed into something more general. But XML Prague is looking promising if you fancy a trip in March 2009.

Have fun next week!

By Norman Walsh on December 5, 2008 1:30 PM

Alas, Bob, I’ve had to give my regrets for XML 2008 (for personal reasons, no worries). I gave them back in early November, and reminded them about the incorrect schedule again a week-or-so ago. Apparently it takes longer than that to update a web page. Who knew?

The panel was supposed to be a “DocBook vs. DITA” sort of a thing and I was looking forward to it. I’ve been, perhaps, way too polite about the subject for perhaps way too long :-)

See you at Balisage?

By John Cowan on December 5, 2008 2:35 PM

The biggest, I grant; better than EML/Balisage, I deny. But that’s why we have horse races.

By Evan Lenz on December 5, 2008 5:31 PM

Bob, that’s great to hear you’re coming. This will be my first XML conference in five years! XML 2003 in Philadelphia was the last one. Were you there? I at least remember chatting with you at Disney World in 2001. :-)

Norm, I was sorry to hear you couldn’t come. I was asked to join the panel based on my experience with WordML, but it will definitely feel like there’s a void without you on the panel. I’d love to make it to Balisage next year. I feel kind of ashamed of admitting I’ve never made it to Extreme before either.

By Sarah Bourne on December 11, 2008 4:24 PM

Just read your post today - too late for anything other than regrets for not attending myself. Hope you get a chance to post highlights (hint! hint!)

By Bob DuCharme on December 11, 2008 4:44 PM

I’m too swamped with work to work up a posting for a few days, but managed to achieve a lazy, Web 2.0-oriented equivalent: when I suggested (http://twitter.com/bobdc/status/1042218934) on Twitter that people use the hashtag #xml2008, enough people picked up on it to provide a nice narrative: http://search.twitter.com/search?q=%23xml2008

SPARQL and live relational data

Bob DuCharme — Mon, 01 Dec 2008 17:19:54 -0500

A little demo.

In the first project I did with SPARQL, D2RQ, and MySQL I used D2RQ to pull all the relational data into a disk file and then queried that after adding some OWL-based metadata. D2RQ does let you execute SPARQL queries against a live relational database, instead of dumping data to a file and querying that, so I wanted to see the effects for myself. This would work better as a live demo, but you could think of it as a script for one.

First, because MySQL is a multi-user database, imagine that several users are simultaneously using the same copy of the “world” database that I described in an earlier entry. This will make my fake demo look more dramatic. (For additional drama, imagine bullets whizzing by my head as I type the various queries and commands.) I’ll start with a SPARQL query asking about the head of state for France:

SELECT ?headOfState WHERE { 
?s vocab:country_Name "France";
   vocab:country_HeadOfState ?headOfState.
}

With the version of the world database currently available from MySQL, that query returns “Jacques Chirac”. In fact, the database lists him as the head of state for several countries; this query

SELECT ?name WHERE { 
?s vocab:country_Name ?name;
   vocab:country_HeadOfState "Jacques Chirac".
}

returns this list:

"Guadeloupe"
"Martinique"
"Mayotte"
"France"
"French Guiana"
"French Polynesia"
"Réunion"
"Saint Pierre and Miquelon"
"New Caledonia"
"Wallis and Futuna"
"French Southern territories"

Now imagine that someone else using the same database updates it with the following query at the MySQL command line:

mysql> UPDATE country
    -> SET HeadOfState="Nicolas Sarkozy"
    -> WHERE HeadOfState="Jacques Chirac";
Query OK, 11 rows affected (0.08 sec)
Rows matched: 11  Changed: 11  Warnings: 0

(I look forward to making a similar update for the United States entry in January.) When I rerun my original SPARQL query about the vocab:country_HeadOfState value for the subject that has a country name of “France”, I get the updated answer: “Nicolas Sarkozy”.

When an interface such as D2RQ provides access to a relational database, SPARQL provides an excellent tool for looking at the data. Of course, if you can access that database using an SQL command line, you have even more options, but how many publicly accessible relational databases let you issue SQL commands against them? More and more offer SPARQL access, so SPARQL will be an increasingly valuable tool for getting at increasing amounts data. (Not that SPARQL’s future is limited to read-only access—an UPDATE language for SPARQL is in the works.)

2 Comments

By Prateek on December 3, 2008 4:02 PM

Have a question about the statement “but how many publicly accessible relational databases let you issue SQL commands against them?More and more offer SPARQL access, so SPARQL will be an increasingly valuable tool for getting at increasing amounts data “.

When I search for information on a Website (not search engines),lets say Geonames and I look for “New York”.Isn’t the search a query against a database?Plenty of websites, I think provide querying against publicly accessible relational database.The complexity of learning and writing SQL is hidden from the end user.

In the case of Geonames,its a MySQL based store.It makes it easy for a naive user to search for information in Geonames because there is no necessity to learn the query language or SQL.

My question

(1) Isn’t the pain of learning and writing SPARQL one of the biggest hindrance in as you have it “becoming an increasingly valuable tool for getting at increasing amounts data”.?

(2) Or because of this it will continue to remain a tool in the hands of people of SW Community?.

By Bob DuCharme on December 3, 2008 4:50 PM

As you quoted, I did say “let you issue SQL commands against them,” not “query the databases,” so I wouldn’t count form-driven queries against MySQL backends as relational queries of public data. You don’t have the flexibility to make up your own queries. As a matter of fact, I’m sure we’ll see more forms triggering SPARQL queries on the back end over time, so comparing SPARQL queries to form-driven queries of relational databases is not an apples-to-apples comparison.

We could call SPARQL a tool in the hands of the SW community, but we could also call SQL a tool in the hands of the relational community. The biggest difference to me is that if you know SQL well, your options for writing an app that combines data from multiple public SQL databases are very limited. You query your personal data or your employer’s. The increasing amount of SPARQL-accessible data is what’s opening up the possibilities.

A video from a still camera

Bob DuCharme — Mon, 24 Nov 2008 21:03:50 -0500

With various strange noises and images.

I’m on my second Canon Powershot right now, having gotten my first one in early 2003. These can take brief movies, but the Powershot is not really a movie camera, so I only took three- or four-second movies of things that I could loop. In November of 2004, once I had a collection of these clips and access to a Mac with iMovie installed, I took a few loopable shots of my daughter Alice playing the drums so I could tie the whole thing together with a regular beat and edited it into a trippy little movie.

The first shot has some kids spinning light things in the dark and my daughter Madeline jokingly complaining about Alice’s spinning thing hitting her, which gave the movie its title. In the four years since then, Alice has gotten a Ludwig kit with some entry-level Zildjian cymbals (the one in the movie sounds awful) and she’s also grown a few inches and improved on the drums quite a bit.

The strange capsule bouncing way up in the air at 1:11 has Dan Brickley in it, if I remember correctly. This was at a carnival on the Dam Square in in Amsterdam during the XML Europe 2004 conference. (If it’s not him, it has at least one of the members of the RDF All Stars panoramic picture that I took earlier that day while demonstrating the camera’s panoramic capability to Uche Ogbuji, who had the exact same camera but didn’t know about this feature. Dave Beckett, with his back to Leigh Dodds, lost out in the stitching of the images, which gives him the appearance of a disappearing ghost.) In addition to that carnival, a local Virginia county fair and a Jersey shore boardwalk arcade provided the more American-looking carnival images. Oxford residents may recognize the playground on Abingdon Road.

I recently got a Flip Video camera (whose logo owes much to the one from the Flip Wilson show in the early seventies). I’m having a lot of fun with it, although what I’ve done so far are outright home movies, which I won’t be putting here. I may bring it to XML 2008, though…

\ \

2 Comments

By zarina on December 8, 2008 8:57 PM

This is in regard to your blog entry about semantic data entry.
You were looking for ways in which people record varied data on a daily basis.
Please do checkout Tinderbox @ http://www.eastgate.com/Tinderbox/\

By Bob DuCharme on December 8, 2008 11:35 PM

Wow, I didn’t realize that eastgate was still around.

bobdc blog

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Triples about existing triples

The simple way: annotation syntax

Quoted and asserted triples

Querying for labels

Treating those rdfs:label variations as rdfs:label values

Some extra help from the Wikidata Query Service

Human-readable names in RDF

rdfs:label

schema:name

A brief detour: dc:title and skos:prefLabel

My brief tenor banjo career

Tenor Banjo Chord Cheat Sheet

Nicer date and time handling in SPARQL 1.2

Passing your own data to use in Wikidata visualizations

Entity recognition from within a SPARQL query

Getting ChatGPT to turn a flat vocabulary list into a hierarchical taxonomy

More advice about software documentation

Documenting APIs

Introducing RDF and related standards

Normalizing company names (and more) with SPARQL and Wikidata

As a SERVICE…

Using the AWS Graph Explorer with Fuseki and local datasets

SPARQL and OWL on the command line—of my phone!

Web3 and Web 3.0 at OriginTrail

SPARQL queries of git repository data

Your own free, publicly available SPARQL endpoint

Tell AWS you want to launch an instance

Configure and launch the instance

Review your running instance and start a terminal session with it

Download and unzip the Jena software

Install Java

Try the Fuseki server

Create an empty dataset for the triples that you will load

Load some triples into the new dataset

Query the data

Your own SPARQL web server

More Picasso paintings in one year than all the Vermeer paintings?

What triples say “Picasso made this painting”?

What triples say “It’s a painting”?

What triples say “It’s by Picasso”?

What triples say what year a painting was created?

What paintings in what years?

How many Picasso paintings per year?

Learn RDF in Y minutes

SPARQL and Instacart's Knowledge Graph

Generating websites with SPARQL and Snowman, part 2

Add lists of artists works with links to more information

Looking more stylish

Query for 3D works

Possible next steps

Generating websites with SPARQL and Snowman, part 1

Create, load, and view a sample project

Point the project to the ArtBase endpoint instead of the default one

Query for artist names instead of random triples

Adjust the display template to use data from the revised query

Queries to explore a dataset

How many triples does this dataset have in all?

Show all the types being used

Count instances per type

Count the properties that each type uses

List properties per type

Have a query create a schema for this schemaless data

Doing a podcast interview about technical writing

Taking some RDF beyond what it could do in a relational database

Easier addition of new properties that only apply to a few instances of some classes

Linking to other data sets out there (Linked Data!)

Easy federation and integration of new data

Inferencing: finding new facts and connections

Converting Digital Humanities paper and conference metadata to RDF

Converting the data

Making the RDF better than the relational data

Using standards instead of ad-hoc namespaces

Improving the links between resources

What have we got?

Next steps