Querying a set of named RDF graphs without naming the graphs

A big step toward using named graphs to track provenance.

I’d like to thank everyone who added comments to my last post, Some questions about RDF named graphs. Lee Feigenbaum wrote an entire blog post addressing the issues I raised, and it looks like his Open Anzo triplestore (which I’ll write up in its own post soon) has some nice support for versioning, access control, and replication.

It all worked fine in Sesame and Virtuoso.

Jeni Tennison’s comment was a bit embarassing, because it showed that the answer to my key question was right in the SPARQL specification. I have read the entire spec, but didn’t understand the point of named graphs at the time, so that part didn’t sink in the way it should have.

To review my third question, which built on the first two:

If we’re going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let’s say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 “FROM NAMED” clauses in my query? (Let’s assume that I plan to query those same 432 multiple times.)

I want to put each batch in its own graph so that I can store metadata for each batch. I also want to write a query that retrieves triples from a set of graphs, and when new graphs are added to the set, I don’t want to have to rewrite the query. Based on the example that Jeni pointed me to, I now know how to do this, and I assembled a working demo. It’s a pretty low-level demo; in my next posting I’ll describe one or two more real-world scenarios of applying these ideas, because I’d like some opinions on whether the architecture I have in mind makes sense.

The SPARQL spec explains why I don’t need multiple FROM NAMED clauses to issue a single query against multiple graphs: “the GRAPH keyword is used to match patterns against named graphs. GRAPH can provide an IRI to select one graph or use a variable which will range over the IRI of all the named graphs in the query’s RDF dataset”. So, if I use a variable that ranges over 432 named graphs, I just need a pattern to identify those 432 graphs—ideally, a pattern that still works the following week if I need it to range over 433 graphs, then 434 graphs, and so forth.

For my demo, I created three named graphs and a query that retrieves data from two by using a GRAPH pattern instead of explicitly naming them. Each graph assigns a Dublin Core title value to a book whose identifier is based on its ISBN, and the first two graphs identify themselves as subgraphs of http://www.snee.com/ng/mygraph.rdf.

My first graph:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdfg="http://www.w3.org/2004/03/trix/rdfg-1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdfg:subGraphOf rdf:resource="http://www.snee.com/ng/mygraph.rdf"/>
</rdf:Description>

<dc:title>XSLT Quickly</dc:title>
</rdf:Description>

</rdf:RDF>


(Thanks also to Jeni for pointing me to the http://www.w3.org/2004/03/trix/rdfg-1/ vocabulary for describing graph relationships.) Second graph:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdfg="http://www.w3.org/2004/03/trix/rdfg-1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdfg:subGraphOf rdf:resource="http://www.snee.com/ng/mygraph.rdf"/>
</rdf:Description>

<dc:title>XML: The Annotated Specification</dc:title>
</rdf:Description>

</rdf:RDF>


The third graph, whose data shouldn’t show up in the query results, because it’s not a subgraph of http://www.snee.com/ng/mygraph.rdf:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdfg="http://www.w3.org/2004/03/trix/rdfg-1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<dc:title>SGML CD</dc:title>
</rdf:Description>

</rdf:RDF>


The following test query retrieved the titles from all three graphs, because it has no qualifications about which graphs to retrieve from:

PREFIX dc:<http://purl.org/dc/elements/1.1/>

select ?title WHERE {?s dc:title ?title}


This next query, however, only wants dc:title values from graphs that are subgraphs of http://www.snee.com/ng/mygraph.rdf:

PREFIX dc:<http://purl.org/dc/elements/1.1/>
PREFIX rdfg:<http://www.w3.org/2004/03/trix/rdfg-1/>

select ?title
WHERE { ?g rdfg:subGraphOf <http://www.snee.com/ng/mygraph.rdf>
GRAPH ?g {?s dc:title ?title}
}


It all worked fine in Sesame and Virtuoso. (One note: when you load a graph into Sesame using its workbench interface, you can specify a URL for the graph’s name, so I chose names of the form http://www.snee.com/ng/myredgraph.rdf shown in the sample data files above. When loading the graphs into Virtuoso, its default behavior is to assign graph name URLs based on the URL of the WebDav folder used to load it—see my earlier posting on Getting Started with Virtuoso for more on this—so for the data I loaded into Virtuoso, I used URLs that followed the form http://local.virt/DAV/home/joeuser/rdf_sink/myredgraph.rdf for the rdfg:subGraphOf triples.) Update: it works in OpenAnzo as well. When I first tried that, I didn’t know about the -A command line option when querying; see also Lee’s comment below. More on OpenAnzo in an upcoming post.

Now I know that the SPARQL standard and multiple open source implementations support the querying of a set of named graphs without requiring me to list them all, so the use of a large amount of graphs doesn’t sound so unwieldy. This is an important application building block, and next I’ll describe some things that sound sensible to build.

Bob, you said:

“""
The SPARQL spec explains why I don’t need multiple FROM NAMED clauses to issue a single query against multiple graphs: “the GRAPH keyword is used to match patterns against named graphs. GRAPH can provide an IRI to select one graph or use a variable which will range over the IRI of all the named graphs in the query’s RDF dataset”. So, if I use a variable that ranges over 432 named graphs, I just need a pattern to identify those 432 graphs—ideally, a pattern that still works the following week if I need it to range over 433 graphs, then 434 graphs, and so forth.
"”"

I’m not sure this is quite correct. From the SPARQL specification’s point of view, you (or your SPARQL engine) do indeed need to specify the graphs that comprise the RDF dataset against which you are querying.

What makes this tractable is that some stores will, by default, make the default graph the RDF-merge (union, basically) of all of the graphs in the store and also add all graphs in the store as named graphs in the dataset.

Other stores (e.g. Open Anzo) provide a “magic” URI to stand for “all graphs”, or introduce the concept of named datasets, computed datasets, etc. to address this challenge.

Lee

I thought you might also be interested in the work done in CORESE for graph and paths handling :
http://www-sop.inria.fr/edelweiss/software/corese/v2_4_1/manual/next.php
in particular the nested graph and recursive querying mechanism.

on the RDF side these extensions build on a previous member submission:
http://www.w3.org/Submission/rdfsource/

Cheers,

\

In response to Lee’s comment:
“I’m not sure this is quite correct. From the SPARQL specification’s point of view, you (or your SPARQL engine) do indeed need to specify the graphs that comprise the RDF dataset against which you are querying.”

No, that’s not actually true. I agree that this *seems* to be what the spec is implying, but it’s not true.

If you use a variable in a GRAPH statement, then the variable will range over the dataset names - all graph names in scope. FROM NAMED does create a scope, but if it is not used in the query then it’s all the graph names.

I made this mistake as well, and was trying to limit my SPARQL implementation to not allow unbound variables in the GRAPH position, but Andy Seaborne set me straight. (Andy is the editor of the SPARQL spec document, and the implementor of Jena of SPARQL for Jena).

Paul, I beg to differ.

The SPARQL specification has no concept of “all the graph names”. In the absence of any explicitly defined dataset, the query is run against a dataset that is chosen by your implementation (your SPARQL engine).

For *some* implementations, this means that the query is run against all the graphs that the engine knows about. For other implementations, this means that the query is run against an empty dataset. For still others, an engine may be hardwired to query specific graphs in the absence of an explicitly given dataset.

This is a very common misunderstanding about the SPARQL specification. I’m guessing that Andy was either speaking specifically about what Joseki/ARQ do or that there was a miscommunication there.

best,
Lee

Hi Lee,

My perspective on this comes from an email conversation with Andy, partly on a mailing list, and partly in private. The context was when I was implementing SPARQL on Mulgara (Andy is on the Mulgara mailing list).

I’m sure Andy won’t mind if I repeat part of one of our private emails here. I had been referring to an unbound variable named “x” which referred to the graph:

?x ranges over the dataset names - all graph names in scope.

FROM NAMED may have created a scope with certain names, but FROM NAMED is not necessary in a query anyway.

Just query with

SELECT ?g { GRAPH ?g { ?s ?p ?o } }

All names of graphs.

In a later (public) email, he replies to a comment from me:

> past, and wasn’t satisfied that it was allowed. Also, when I wrote my
> email this morning, then I looked it up again (so I didn’t look like
> an idiot - as I am wont to do) and again, I wasn’t satisfied that it
> was allowed.

Paul,

Examples can’t be exhaustive! We tried to put in as much as we could but not every single case can be an example.

There are tests in the test suite: graph/graph-02 to -09 or thereabouts

Section 12 gives the definition of the evaluation of a graph pattern.

He also gave a final followup:

> I take it that 12.5 is the area of most relevance here? Specifically,
> the definition of:
> eval(D(G), Graph(var,P)) = …..
> ???

Yes - that’s it: the example in section 8.3.4 is:

1 PREFIX dc:
2 PREFIX foaf:
3
4 SELECT ?name ?mbox ?date
5 WHERE
6 { ?g dc:publisher ?name ;
7 dc:date ?date .
8 GRAPH ?g
9 { ?person foaf:name ?name ;
10 foaf:mbox ?mbox .
11 }
12 }

which is the algebra expression:

1 (base
2 (prefix ((dc: )
3 (foaf: ))
4 (project (?name ?mbox ?date)
5 (join
6 (BGP
7 (triple ?g dc:publisher ?name)
8 (triple ?g dc:date ?date)
9 )
10 (graph ?g
11 (BGP
12 (triple ?person foaf:name ?name)
13 (triple ?person foaf:mbox ?mbox)
14 ))
15 ))))

And being a applicative-order evaluation (graph ?g … ) is evaluated as in 12.5 " Evaluation of a Graph Pattern " then participates in the join. That ?g is unconstrained at the point of evaluating the (graph …).

D, the dataset, can come from the protocol, the query or the execution environment (in that order of priority).

It’s entirely possibly that I’m misinterpreting what Andy is saying here, but my reading of it is that an unbound variable using in a GRAPH expression will evaluate to all known graphs. Also, I’m not saying that Andy is the definitive source for this information, but he is certainly clearer on it than I am. :-)

In SPARQL, a query executes over a dataset. GRAPH accesses the names in the dataset. If it’s “GRAPH ?g” and then ?g ranges over all names in the dataset. This is in “Definition: Evaluation of a Graph Pattern” (third case). ?g may be constrained in other parts of the query.

Where the dataset comes from is either from a description or the dataset is decided by the query service.

There are two ways to describe the dataset - in the query with FROM/FROM NAMED, or the protocol with default-graph-uri/named-graph-uri. If the dataset is described, the the dataset must be what is described and not more (or worse, different).

If the dataset is not described whatever the service provides, it can be set up is a variety of ways - the case of having the default graph being the manifest and other metadata for the named graph sis quite an interesting set up for the provenance situation.

Bob’s query

PREFIX dc:
PREFIX rdfg:

select ?title

WHERE { ?g rdfg:subGraphOf

GRAPH ?g {?s dc:title ?title}

}


will work on any SPARQL system that can be set up with the dataset having all the named graphs in it and the default graph being the manifest (and other details) of named graphs.

A query service can reject any query it chooses not to answer. Maybe it only processes queries with a description, maybe it only processes queries without a description. The latter case is important because it is the case of providing access to a published dataset over the web. Here, just providing query over that one dataset is what the service does. It might even execute a query if the description matches but there is no obligation for it to do that. After all, the dataset may not be describable if it’s some large relational database fronted by a SPARQL query service.

A graph can appear more than once in the dataset under different names or once as the default graph and also with a name. The default graph or any named graph may be some calculated form of other graphs such as the RDF merge. What matters is that at the time the query is executed, there is a dataset, and that the dataset has a default graph and zero or more named graphs.

Aside: the test assume the default if not mentioned is the empty graph. That’s just convenience for the tests and compatible with the fact that if there is a description of the dataset and the default graph is not otherwise mentioned (FROM, default-graph-uri) then it is empty. The test provide another way to describe the dataset for the purposes of the tests.

I was quoted as saying:

"?x ranges over the dataset names - all graph names in scope." and the “all” here is all names in the dataset because a query is executed over a dataset. See “Definition: Evaluation of a Graph Pattern”.

When Paul says: “my reading of it is that an unbound variable using in a GRAPH expression will evaluate to all known graphs." this is true where “all known” means all known in the dataset. Not all graphs on the web.

LeeF said: “From the SPARQL specification’s point of view, you (or your SPARQL engine) do indeed need to specify the graphs that comprise the RDF dataset against which you are querying." is also true but the tricky part is “or your SPARQL engine”. It seems to me that the issue is about what is the dataset. While one could have a dataset which is “all graphs on the web, by name” or some such dataset, no implementation could realise that but still we have a dataset specified. The requirement is to have a dataset when the query executes and to have a set of names that can be iterated over for the evaluation of GRAPH with a variable that is not constrained in any way could not be met.

Some systems have a notion of the graphs that they have in their storage. The dataset description is interpreted as meaning “pick graphs out of the collection of stored graphs”. Nothing wrong with that model, but it’s not required by the specs.

I do think it would be wrong to be overly prescriptive, especially about the default graph. Some systems foce that to be teh RDf merge of the named graphs, and that would preclude Bob’s use case of the default graph having the manifest information.

Finally, we have a new working group running. All that matters is the text in the documents, and not the intent of the text, what implementations do nor what authors thought they meant when writing the text. The workign group comments list is the place to send suggestions for improving the text. Hint, hint.

Thanks Andy! My use of the default graph for manifest information was just off the top of my head when coming up with the example. It sounds like a specific named graph for these triples would be a good idea.

Bob