Selling RDF technology to Big Data

A clue: what we're selling is just that—RDF technology.

I think I’ve figured it out. (This is a follow-up to my previous post SPARQL and Big Data (and NoSQL): How to pursue the common ground?) Here’s how to sell the Semantic Web and Linked Data visions to the Big Data folk: don’t. Sell them on RDF technology.

Instead of telling these people about the Semantic Web or Linked Data visions, we should show them how we have technology that fulfills the vision that's apparently captured their imaginations.

The process of selling a set of technologies usually means selling a vision, getting people psyched about that vision, and then telling them about the technology that implements that vision. For RDF technology (by which I mean RDF, SPARQL, and optionally, RDFS and OWL), the vision for many years was the Semantic Web. Some people in that community eventually decided that an easier vision to sell was Linked Data. (Linked Data may not always include RDF technology—when Tim Berners-Lee added “(RDF*, SPARQL)” to his list of Linked Data principles, it became the filioque controversy of the Linked Data community—but the boundaries of this or other sets of technologies I’m discussing are not the issue here. The point is, it’s very common to use the Linked Data vision to sell people on the value of using URIs, triples, and SPARQL together.)

Big Data is itself a vision. Note how it’s spelled in initial caps, like “Semantic Web” and “Linked Data,” and features prominently in sales pitches from large and small system vendors. The 166-page IBM educational/marketing PDF “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data” (available here with registration) is mostly about the Big Data vision: the issues, the common use cases that can now be handled, and in general, the possibilities. Instead of trying to sell Big Data people on one or two of our overlapping visions, we should be showing them the connections between our technology and the vision that they’re already sold on.

Hadoop and NoSQL are currently the technologies being used to implement this vision. Hadoop is a software framework for certain kinds of distributed applications; its MapReduce algorithm is also implemented by several of the NoSQL database managers. “NoSQL” is a blanket term for a family of database management technologies that were developed independently of each other with no particular standards or organization to coordinate between them (other than not being SQL), so a new addition to the family is not going to look like some odd appendage to a seamless whole. In the book Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement that I just read, the example database managers are PostgreSQL, Riak, HBase, MongoDB, CouchDB, Neo4J, and Redis. (While some people and organizations do work RDF into their NoSQL discussions, it’s not mentioned anywhere in this book.)

Besides PostgreSQL, the other database managers covered by the book are all considered to be NoSQL systems. Each has techniques for addressing the Big Data vision, which Edd Dumbill, IBM and may others summarize by discussing the three Vs: Volume, Velocity, and Variety. For reasons described in my last blog entry, RDF technology is excellent for addressing all of the Variety issues, and reading “Seven Databases” has further convinced me of this. Velocity, and to some extent Volume (more on this one below), are issues for a platform to address, not a set of standards, so for that you need to talk to an RDF-related platform vendor such as TopQuadrant. (We’d be happy to discuss your requirements.)

It’s a cliché in engineering-related sales that you have to focus on customer requirements. It’s also a sales cliché that talking about technical details will bore the people who write the checks. IBM and other such companies are putting big money into marketing Big Data solutions because they’ve found suit-wearing, check-writing managers who feel that their requirements line up with the promises of these solutions. Instead of telling these people about the Semantic Web or Linked Data visions, we should show them how we have (standardized!) technology that fulfills the vision that’s apparently captured their imaginations.

Reading the “7 Databases” book, I realized that the CAP theorem, although based on some technical issues, is also part of the Big Data vision. If I understand it correctly, the basic idea is that database administrators have always wanted Consistency, Availability, and Partition tolerance in their databases, and that distributed databases can only do two of these at a time well, and that by deciding that you can work around subpar performance for one if you can get great performance with the other two, new possibilities emerge—possibilities that wouldn’t have occurred to earlier generations of database administrators who strained to optimize all three. For example, if you give up the need to have all data on all nodes be consistent with all the data on the other nodes all the time (and you include steps to have it eventually become consistent, just not all the time), you can get increased availability (as long as one server is running, the database will return something) and partition tolerance (loss of communication between nodes won’t affect the system).

One point I didn’t make in my last posting is that the ease with which you can distribute and aggregate RDF triples in any combination gives you a lot more flexibility in how you implement your own two-out-of-three CAP theorem tradeoff. This should make it easier to store triples using one of the distributed NoSQL platforms; I don’t know of any definitive steps in this direction yet, but as I said before, Google searches show bits of work here and there.

This potential good fit of triples to the new possibilities opened up by the CAP theorem hadn’t occurred to me when I wrote my last blog entry, but by further study of the vision associated with the hot new data processing goals, I found another connection between that vision and the technology that we RDF types are offering. Which is just what we should be doing: identifying connections between what our technology can do and what these customers need. Especially customers who are pumped up about the latest big technology vision.