Avoiding accidental cross products in SPARQL queries

Because one can sneak into your query when you didn't want it.

Check the variables in your triple patterns that are connecting up sets of triples with other sets. They may not be doing a good job of it.

Have you ever written a SPARQL query that returned a suspiciously large amount of results, especially with too many combinations of values? You may have accidentally requested a cross product. I have spent too much time debugging queries where this turned out to be the problem, so I wanted to talk about avoiding it.

Let's look at a simple example. The following RDF triples show the names of three people and the departments where they work:

@prefix d:  <http://learningsparql.com/ns/data#> .

d:emp1 d:name "jane" .
d:emp2 d:name "joe" .
d:emp3 d:name "pat" .

d:emp1 d:dept "shipping" .
d:emp2 d:dept "receiving" .
d:emp3 d:dept "accounting" .

The following SPARQL query attempts to list each person and their department:

PREFIX d:  <http://learningsparql.com/ns/data#> 

SELECT ?name ?dept WHERE {
  ?employee d:name ?name .
  ?emp d:dept ?dept .
}

The result of this query somehow shows that all the employees work in all the departments:

-------------------------
| name   | dept         |
=========================
| "pat"  | "shipping"   |
| "pat"  | "receiving"  |
| "pat"  | "accounting" |
| "jane" | "shipping"   |
| "jane" | "receiving"  |
| "jane" | "accounting" |
| "joe"  | "shipping"   |
| "joe"  | "receiving"  |
| "joe"  | "accounting" |
-------------------------

Why? Experienced SPARQL users probably already saw the problem: the query's first triple pattern says “find any triples where the predicate is d:name and store the subject in ?employee and the object in ?name”. The second triple pattern should ask for the department of any employee that we found in the first triple pattern (?employee). Instead, it's just asking for all triples with d:dept as the predicate and binding the subject and object to the ?emp and ?dept variables, which have nothing to do with the first triple pattern. If the second triple pattern had used the variable name ?employee instead of ?emp, the query would have asked for resources that matched both triple patterns, and would have given this result:

-------------------------
| name   | dept         |
=========================
| "pat"  | "accounting" |
| "jane" | "shipping"   |
| "joe"  | "receiving"  |
-------------------------

I got three times as many results as I wanted because I created the new variable name ?emp when I should have re-used the existing one ?employee. Avoiding such variable name sloppiness is why some programming languages force you to declare variables. It's also why others that don't, such as JavaScript and perl, offer optional add-ins that force this extra bit of housekeeping.

When the Franz Allegrograph triplestore sees a cross product it offers a query warning automated alert called warn-bgp-cross-product, so I'll bet that has saved their developers a lot of wasted time. The documentation for this potential warning has a nice summary of what causes cross products: “there are patterns in the query that have disjoint sets of variables which will cause the SPARQL engine to find all possible matches between the sets which can lead to very large solution sets”. (Some pdf class notes for a Colorado State University database class show how this works with relational databases.)

In my example cross product above, note that the offending variable names are not mentioned in the SELECT statement and therefore are not in the results. I have found that this can add plenty to the time it takes to identify a cross product as the source of a problem, because these mismatched variables are like cogs that are not meshing together correctly deep inside a machine where you can't see them very well. This is especially true in a larger, more complex query; my query above is a small toy example to make the problem as clear as possible.

One larger, more complex query where this happened was the second SPARQL query in my Document analysis with machine learning blog entry last month. Not only did it cost me extra hours of work; the results were so bloated that arq was running out of memory, so I started doing the query in Blazegraph instead. When I noticed the same cosine similarity figure coming up with dozens of recipe pairings, this was the first warning that I had a cross product problem, just like with the repetitive patterns of all employees working in all departments above. I had no problem running the query with arq once I found the mismatched variable names and straightened out the cross product problem.

So, if you see such repetition and get suspicious, check the variables in your triple patterns that are connecting up sets of triples with other sets. They may not be doing a good job of it.