SPARQL in a Jupyter Notebook
For real this time.
A few years ago I wrote a blog post titled SPARQL in a Jupyter (a.k.a. IPython) notebook: With just a bit of Python to frame it all. It described how Jupyter notebooks, which have become increasingly popular in the data science world, are an excellent way to share executable code and the results and documentation of that code. Not only do these notebooks make it easy to package all of this in a very presentable way; they also make it easy for your reader to tweak the code in a local copy of your notebook, run the new version, and see the result. This is an especially effective way to help someone understand how a given block of code works.
When these notebooks were first invented they were known as IPython (“Interactive Python”) notebooks. At the time, all the executable code was Python, but since then the renaming to “Jupyter” has been accompanied by support for more and more languages—even multiple languages in the same notebook. It wasn’t supporting SPARQL yet when I wrote the post described above, but my “just a bit of Python to frame it all” automated the handoff of SPARQL queries to the rdflib Python library so that ideally even someone who didn’t know Python could enter SPARQL queries into a notebook and see the results as part of the notebook.
The wait for the real thing is over. Paulo Villegas has released a SPARQL kernel for Jupyter notebooks that lets us run queries natively, and I have been having some fun with it. The project’s sparql-kernel git repository has good documentation in its readme file. There’s no need to clone the project; the following three commands installed the sparqlkernel files locally for me, installed those into my copy of Jupyter, and then started up Jupyter.
pip install sparqlkernel jupyter sparqlkernel install --user bob jupyter notebook
At this point I was looking at Jupyter in my browser, and when I clicked the “New” button to create a new notebook I saw SPARQL as a choice right under “Python 3”.
While the SPARQL processing in my earlier post about Jupyter was handled by rdflib, this SPARQL kernel functions more as a very nice interface to a SPARQL endpoint that you specify. Or endpoints that you specify—as we’ll see, it’s very easy to switch between endpoints in one notebook. You specify the endpoint to talk to using a Jupyter “magic” command, which is a special command that begins with a percent sign.
Once I was set up with this, I created a new notebook titled Jupyter and SPARQL and Dort or Dordrecht where you can read and see the various steps I took to retrieve triples from two different endpoints about a famous J.M.W. Turner painting. (Another great thing about Jupyter: github understands it well enough to host the notebooks so that they look the same as they look in a browser pointing at a local Jupyter server. Sometimes when I follow the link to my new notebook, after a minute it tells me “Sorry, something went wrong” and displays a “reload” button, and then after clicking that button it usually works pretty quickly.) You can see the results of my queries right in the notebook, and if you download it and install Jupyter and sparql-kernel you can modify the queries and rerun them yourself. (For the notebook’s last query you’d need a triplestore such as Fuseki running locally at localhost:3030. It doesn’t even have to have any data in it; as you’ll see in my new notebook, I used Fuseki to execute a federated query across the other two endpoints.)
While creating my new notebook, sometimes I was about to plug a new query into it and thought “I should put this query into its own file and send it off to to the endpoint with curl just to make sure it works properly” because that’s such a reflex reaction for me. For trying out queries and iteratively tuning them, though, doing them right in the notebook is much easier than editing a text file and sending it off to the endpoint with a shell command, because I can see the query and results (or errors) right there in the same glance. Despite being a diehard Emacs guy I’m pretty confident that this will be my new routine from now on. When I develop multiple related queries in parallel, although I love Emacs’ sparql-mode (which also hooks up to an endpoint and shows your result right with your query), I still have to keep track of which query is in which buffer. In a Jupyter notebook, I can put nicely-formatted text blocks before and after each query to describe what each query is supposed to do and to annotate my progress with each query.
I don’t want to write things here that are redundant with my new notebook about the Turner painting or with my earlier blog entry about Jupyter, so I encourage you to read the latter if you’d like to learn more about why Jupyter notebooks are so great and the former if you want to see the new powers that sparql-kernel adds to Jupyter for SPARQL users. I know I’m going to be a much more regular user of this nice tool.
(Note: just yesterday I learned that Jupyter’s competitor Apache Zeppelin also has a SPARQL plugin, so that is something to check out as well.)
Share this post