Queries to explore a dataset

Even a schemaless one.

I recently worked on a project where we had a huge amount of RDF and no clue what was in there apart from what we saw by looking at random triples. I developed a few SPARQL queries to give us a better idea of the dataset’s content and structure and these queries are generic enough that I thought that they could be useful to other people.

I’ve written about other exploratory queries before. In Exploring a SPARQL Endpoint I wrote about queries that look for the use of common vocabularies that might be used at a particular endpoint, and how getting a few clues led me to additional related queries. That blog post also mentioned the “Exploring the Data” section of my book Learning SPARQL, which has other general useful queries.

You can see those listed in the book’s table of contents; they often assume that some sort of schema or ontology is in use. A great thing about SPARQL and RDF, though, is that with no knowledge of a schema or any other clues about a dataset’s contents, simple queries can still let you explore that dataset to see what’s there. Today’s exploratory queries were not included among those that I described above.

Example output for each query uses the Beatles Musicians dataset that I described at SPARQL queries of Beatles recording sessions.

How many triples does this dataset have in all?

SELECT (COUNT (*) AS?tripleCount) WHERE {
   ?s ?p ?o
}

Definitely a hall of fame, classic query. Here is the result for the Beatles musician data after performing the query with the Jena arq command line query engine:

---------------
| tripleCount |
===============
| 4089        |
---------------

Show all the types being used

Never mind whether any types were declared; how many types are used? List them, but don’t repeat any.

SELECT DISTINCT ?type WHERE {
   ?s a ?type
}

The result with the Beatles musician data:

----------------------------------------------------
| type                                             |
====================================================
| <http://learningsparql.com/ns/schema/Song>       |
| <http://learningsparql.com/ns/schema/Musician>   |
| <http://learningsparql.com/ns/schema/Instrument> |
----------------------------------------------------

Count instances per type

Of the types that the previous query found being used, how many instances of each are there? This is useful when you are prioritizing what you’re going to do with the data.

SELECT  ?type (COUNT (?s) AS ?instanceCount) 
WHERE {
   ?s a ?type . 
}
GROUP BY  ?type

The result:

--------------------------------------------------------------------
| type                                             | instanceCount |
====================================================================
| <http://learningsparql.com/ns/schema/Instrument> | 180           |
| <http://learningsparql.com/ns/schema/Song>       | 293           |
| <http://learningsparql.com/ns/schema/Musician>   | 238           |
--------------------------------------------------------------------

Count the properties that each type uses

Of the types that were found above, how many different properties does each use?

SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?c)
WHERE {
   ?s a ?type . 
   ?s ?p ?o . 
}
GROUP BY ?type

Number of properties used in the Beatles data, by type:

----------------------------------------------------------
| type                                             | c   |
==========================================================
| <http://learningsparql.com/ns/schema/Instrument> | 2   |
| <http://learningsparql.com/ns/schema/Song>       | 182 |
| <http://learningsparql.com/ns/schema/Musician>   | 2   |
----------------------------------------------------------

The next query will show us why the Song class uses so many properties.

List properties per type

What are these properties that each type uses? This is also useful for prioritization. Note the similarities with and differences from the previous query.

SELECT DISTINCT ?type ?property
WHERE {
   ?s a ?type .
   ?s ?property ?o .
}
ORDER BY ?type ?property

The following is an excerpt from the middle of this query’s result, with <http://learningsparql.com/ns/schema/Song> reduced to s:Song to make it all fit better here. This sample shows that all the different instruments, with all their different spellings, were properties of each song. (Read more about how that worked in my SPARQL queries of Beatles recording sessions blog post.)

| s:Song | <http://learningsparql.com/ns/instrument/guiro>
| s:Song | <http://learningsparql.com/ns/instrument/guitar>
| s:Song | <http://learningsparql.com/ns/instrument/handbell>
| s:Song | <http://learningsparql.com/ns/instrument/handclaps>
| s:Song | <http://learningsparql.com/ns/instrument/harmonica>
| s:Song | <http://learningsparql.com/ns/instrument/harmonium>
| s:Song | <http://learningsparql.com/ns/instrument/harmonyvocals> 

Have a query create a schema for this schemaless data

Consider that:

  • The dataset has no schema but we found types being used
  • We found properties associated with these types
  • Schemas are themselves datasets of triples
  • SPARQL lets you create triples

This all adds up to the ability to create a schema where there isn’t any. In fact, we can do it with a slight variation on the last query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

CONSTRUCT {
   ?type a rdfs:Class .
   ?property a rdf:Property .
}
WHERE {
  ?s a ?type .
  ?s ?property ?o .
}

Note how the WHERE clause of this query is identical to the one from the preceding SELECT query. Here is an excerpt of what it created with the Beatles session data:

s:Instrument  rdf:type  rdfs:Class .
s:Song  rdf:type  rdfs:Class .
s:Musician  rdf:type  rdfs:Class .
i:recorder  rdf:type  rdf:Property .
i:celesta  rdf:type  rdf:Property .
i:tabla  rdf:type  rdf:Property .
i:tenorsaxophone  rdf:type  rdf:Property .
rdfs:label  rdf:type  rdf:Property .
i:harmonica  rdf:type  rdf:Property .

We could go a little further by having the schema use the rdfs:domain and rdfs:range properties to associate the declared properties with the classes that the query found them with:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

CONSTRUCT {
  ?type a rdfs:Class .
  ?property a rdf:Property .
  ?property rdfs:domain ?type .
  ?property rdfs:range ?otype . 
}
WHERE {
  ?s a ?type  .
  ?s ?property ?o .
  OPTIONAL { ?o a ?otype }
}

Along with the schema triples you see above, this new version adds triples like these:

i:banjo  rdf:type    rdf:Property ;
        rdfs:domain  s:Song ;
        rdfs:range   s:Musician .

It also gives the rdfs:label property rdfs:domain values of s:Instrument, s:Musician, and s:Song, which isn’t quite right; as the RDFS spec tells us, “[t]he rdfs:domain of rdfs:label is rdfs:Resource”. The spec also tells us that “the resources denoted by subjects of triples with predicate P are instances of all the classes stated by the rdfs:domain properties”, which in the case of my example means that every instance with an rdfs:label property is an instrument and a musician and song.

We clearly don’t want to say that, but if you are creating a schema for a dataset that lacks one, CONSTRUCT queries like this can give you a big head start. Just run one or the other with the dataset and then edit the schema that it creates as you see fit.

Comments? Reply to my tweet announcing this blog entry.