SPARQL and Amazon Web Service's Neptune database

Promising news for large-scale RDF development.

December 31, 2017

Amazon recently announced Neptune as an AWS service. As its home page describes it,

Amazon Neptune is a fast, scalable graph database service. Neptune efficiently stores and navigates highly connected data. Its query processing engine is optimized for leading graph query languages, Apache TinkerPop™ Gremlin and the W3C’s RDF SPARQL. Neptune provides high performance through the open and standard APIs of these graph frameworks. And, Neptune is fully managed, so you no longer need to worry about database management tasks such as hardware provisioning, software patching, setup, configuration, or backups.

Apart from the practical aspects of the scalable yet convenient use of RDF and SPARQL that Neptune will enable, it’s exciting to see such a high-profile acknowledgment of SPARQL as a serious development tool. Many organizations already knew this, but judging from the reaction to the Neptune announcement on Twitter, many more people are finally understanding this.

It's exciting to see such a high-profile acknowledgment of SPARQL as a serious development tool.

Rumors have been flying that the Blazegraph triplestore may play some role in Amazon’s new graph store. As Stardog CEO Kendall Clark wrote on ycombinator recently, “Amazon acquired the domains, etc. Many former Blazegraph engineers are now Amazon Neptune engineers according to LinkedIn, etc. It was rumored widely in the graph db world fwiw.” Yahoo Knowledge Graph science and data lead Nicolas Torzec responded to Kendall’s comment with a link showing that Amazon now owns the Blazegraph trademark. (Blazegraph’s website hasn’t shown much activity in a while, with the latest post on their Press page being from May of last year.)

May of last year was also when I wrote Trying out Blazegraph about my positive experiences about this graph store, and after the recent announcement I tweeted that if Blazegraph was part of Neptune, it would be very cool if that included Blazegraph’s inferencing. Pavel Klinov replied by pointing out a Neptune announcement video where they explicitly say that inferencing is not supported.

This hour-long “AWS re:Invent 2017: NEW LAUNCH! Deep dive on Amazon Neptune” video included some other interesting points. Because Neptune supports property graphs via Tinkerpop as well as SPARQL, early in the video the speaker provides some background on property graphs versus RDF. He devotes a good portion of his presentation to talking through an SQL query for people who are unfamiliar with graph databases and then covering comparable SPARQL and Tinkerpop Gremlin queries.

The plug from Thomson Reuters early in the video was nice to see, coming from a large well-known organization that has been taking SPARQL seriously for a while. Later in the video, one slide’s use of Thomson Reuter’s PermID vocabulary with the geonames vocabulary in the same triple was especially nice to see, because while the extent of RDF’s usage continues to be a pleasant surprise for me, I’m also surprised by how many people only use it for the simplicity of the triples data model–they’re missing the data integration power of the ability to mix and match the wide variety of existing vocabularies (and hence data sources) with their own data.

The video’s second speaker talks more about Neptune’s enterprise features such as fast failover, encryption at rest and in transit, and backup and restore, which are all great things to see in a cloud-based triplestore. Neptune offers a lot of room; as this speaker mentions, “Storage volumes are not required to be statically allocated; they actually grow automatically up to a maximum size of 64 terabytes.” The ability to restore a dataset to its state from a previous point in time also sounds very useful.

Once the speakers started taking questions, it looked to me like there were more questions about RDF and SPARQL than there were about Tinkerpop and Gremlin. The former included the question about inferencing, which got a response (as Pavel had pointed out to me) of “we do not have in-database inference currently… we are very interested in use cases for inferencing.” They also said that Neptune’s underlying graph engine was custom-built by Amazon as a graph system, which left me more curious about the potential role of Blazegraph in the released version of Neptune. (Maybe “by Amazon” includes former Blazegraph engineers.)

Some more interesting facts from the question and answer session:

Timeouts of SPARQL end points can be configured.
They have tested it with pretty close to a hundred billion triples. (Who remembers the Billion Triples Challenge?)
Neptune supports release 1.1 of SPARQL Query and Update, and the endpoint supports 1.1 of the SPARQL Protocol.
It supports named graphs, which will be particularly handy for managing multiple datasets when dealing with data at that scale.
While the preview configuration of Neptune does not allow federated SPARQL queries for security reasons, they “do see a lot of use cases for SPARQL federation.”
While Neptune currently doesn’t support “schema concepts or constraints in the graph schema,” it “is something that [they] have on their roadmap.” The Amazon rep first responded to this question by asking if the questioner was talking about something like SHACL; although they do not currently support this, just hearing him mention SHACL showed me that this great new standard is gaining some mindshare out there.

I’m looking forward to playing with SPARQL on AWS Neptune and will certainly be reporting back about my experiences here.

blog

home

blog

categories

writing

music

about

Recent Posts

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Querying for labels

Human-readable names in RDF

My brief tenor banjo career

Nicer date and time handling in SPARQL 1.2

Passing your own data to use in Wikidata visualizations

Entity recognition from within a SPARQL query