Knowledge Graphs!

Semantic Linked Knowledge Web Data Graphs?

Google Knowledge Graph with RDF triples

For several years I thought of “knowledge graphs” as the buzzphrase that had partially replaced “Linked Data”, which was the buzzphrase that had partially replaced “Semantic Web”. In a 2012 blog entry I explained how Hadoop and the new-at-the-time NoSQL databases had convinced me that even if a technology has a funny name, selling it based on the problems it solves makes more sense and ages better than selling a buzz phrase vision and then, if that goes well, describing the technology that enables that vision. (In another blog entry I described how the second edition of my book Learning SPARQL had “55% more pages! 23% fewer mentions of the semantic web!”) In other words, I’ve had time to get more suspicious of buzz phrase visions over the years.

The hot new thing

I also knew that RDF-related vendors have been talking about knowledge graph capabilities for several years, but these same vendors were also talking about Semantic Web and Linked Data capabilities before that, so I thought that they were just rebranding with the new buzz phrase as a marketing strategy. Recently, though, I realized how far the excitement about knowledge graphs had spread independently of that community. My initial surprise was Ben Lorica’s interview with Mayank Kerjiwal on his “Data Exchange” podcast about knowledge graphs. Then, when I tweeted about it, Paco Nathan recommended that I join the Knowledge Graph Conference Slack group. I’d been aware of Lorica and Nathan’s work for years but had given up on RDF-like technology making much of a blip on their radar.

I joined the Slack group and found old friends and some new ones there. When I asked the group about a good definition of “knowledge graph” I was a bit inundated, especially with pointers to vendor explanations. As a bandwagon buzzphrase for our time, many vendors are shouting “That currently hot thing? Yeah! That’s what we do!” (even the SEO sharks have smelled blood in this water) so I was less interested in the vendor perspectives on a good definition. In that Slack thread, Tomas Deely pointed me to the paper simply titled “Knowledge Graphs” (pdf) written by @juansequeda, Antoine Zimmerman (@MonsieurAZ ), and 14 other people whose names were less familiar to me. As Juan explained to me, the paper came out of a Schloss Dagstuhl conference in 2018.

A serious, informative review of knowledge graph technology

It was nice to see the formal discipline of this paper—for example, its description of “the distinction between nodes/edges and entities/relations”—when compared with all the me-too vendor definitions of knowledge graphs floating around. After 5 or 6 years of my looking at knowledge graphs through RDF-colored glasses this paper gave me a broader perspective. Its introduction tells us:

The goal of this tutorial paper is to motivate and give a comprehensive introduction to knowledge graphs: to describe their foundational data models and how they can be queried; to discuss representations relating to schema, identity, and context; to discuss deductive and inductive ways to make knowledge explicit; to present a variety of techniques that can be used for the creation and enrichment of graph-structured data; to describe how the quality of knowledge graphs can be discerned and how they can be refined; to discuss standards and best practices by which knowledge graphs can be published; and to provide an overview of existing knowledge graphs found in practice.

That’s a lot of material to cover, so the paper is long. After I read the 78-page main body of the 132-page paper, Appendix A on page 108 (after 30 pages of 583 footnote references) in particular gave me the perspective that I was looking for on both the long-term and recent histories of knowledge graphs as well as the relative roles of RDF and non-RDF technologies along the way. I very strongly recommend that people interested in knowledge graphs start with this five-page appendix. (Note: I read and took notes about the paper a few weeks ago and just noticed that all of my numbers earlier in this paragraph were off. I then saw the “11 Dec 2020” date stamp on the first page and realized that it has been revised since I read it. I’ve tried to update my numbers and quotes here to reflect the latest version.)

The short version of the history of knowledge graphs begins in 2012 (the year I blogged about reducing my usage of the term “semantic web”!) when an engineering SVP at Google published Introducing the Knowledge Graph: things, not strings. Sections 2 and 3 of the Schloss Dagstuhl Knowledge Graph paper’s Appendix A are titled “‘Knowledge Graphs’: Pre 2012” and “‘Knowledge Graphs’: 2012 Onwards” because this Google article was such a key event in knowledge graph history. Appendix A gives perspective on what makes which research—both before and after 2012—relevant to whatever is now considered knowledge graph technology. Section 3 also sorts out various classes of “knowledge graph” definitions, providing good context on the paper’s own definition given in both its introduction and in its summary at the end: “a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities”. We can all think of followup questions for that definition, but for 28 words it’s pretty good.

Section 3 of the appendix includes an important, lower-level supplement to this definition that addresses the important question of what makes a graph data structure a knowledge graph : “We refer to a knowledge graph as a data graph potentially enhanced with representations of schema, identity, context, ontologies and/or rules”.

Reading through RDF-colored glasses

All of these potential enhancements are covered in some detail in the main body of the paper. Although that coverage often goes for pages without mentioning any RDF-related technology, when I see (through my admittedly RDF-colored glasses) discussions of schema, identity, and ontologies around a tourism example of named node-edge-node triples, I of course see lots of RDF. Here’s an example:

Graphs allow maintainers to postpone the definition of a schema, allowing the data – and its scope – to evolve in a more flexible manner than typically possible in a relational setting, particularly for capturing incomplete knowledge. Unlike (other) NoSQL models, specialised graph query languages support not only standard relational operators (joins, unions, projections, etc.), but also navigational operators for recursively finding entities connected through arbitrary-length paths.

There’s no mention of RDF or SPARQL there, but it certainly lists many of their key capabilities. (I’ll be discussing the wonderful possibilities of partial RDF schemas, as a compromise between no schemas and fully detailed ones, in an upcoming blog entry.) It goes on:

Standard knowledge representation formalisms – such as ontologies and rules – can be employed to define and reason about the semantics of the terms used to label and describe the nodes and edges in the graph.

It was interesting how that and many other parts of the paper discussed capabilities that OWL has provided since RDF’s early years. OWL has a smaller profile in the RDF world than it used to (people who once thought that OWL might help to define data structures that could help maintain data quality have turned to SHACL, because that’s not really what OWL was for, and these users’ confusion over all the different OWL profiles didn’t help) so it was interesting to see how much the Schloss Dagstuhl Knowledge Graph paper discussed ontologies, Description Logics (the DL in OWL-DL), T-Boxes, A-Boxes, Individuals, and especially entailment and related topics where OWL can contribute plenty.

Graph and non-graph technologies

Something else that made me think of knowledge graphs as a vague buzzphrase was the way people used the term to reference technologies that are quite separate from the use of graph data structures: relational databases, text indexing, named entity recognition and other areas that fall under the currently overlapping umbrellas of machine learning and artificial intelligence.

Some of those do have direct applications to graphs, such as the use of embeddings with triples, and the Schloss Dagstuhl paper covers these technologies. It has good reasons for this, describing them as techniques by which knowledge graphs can be “enriched from diverse sources of legacy data that may range from plain text to structured formats (and anything in between).” The paper’s conclusion sums up the relationship among these technologies nicely:

Research on knowledge graphs can become a confluence of techniques arising from different areas with the common objective of maximising the knowledge – and thus value – that can be distilled from diverse sources at large scale using a graph-based data abstraction. Pursuing this objective will benefit from expertise on graph databases, knowledge representation, logic, machine learning, graph algorithms and theory, ontology engineering, data quality, natural language processing, information extraction, privacy and security, and more besides.

(A side note on embeddings, RDF and knowledge graphs: The RDF2VEC algorithm used to do embeddings with RDF has been around since 2016, and you can find many discussions about it since then that refer to RDF graph embeddings. More recent discussions of it, though, have titles like How to Create Representations of Entities in a Knowledge Graph using pyRDF2Vec. It’s another example of an RDF thing that’s been around for a while now being described as a knowledge graph thing because of the current cachet of the term.)

Knowledge graphs and RDF

Section 10.2 of the Schloss Dagstuhl Knowledge Graph paper, “Enterprise Knowledge Graphs”, includes footnoted mentions of over a dozen brand-name companies who have discussed their knowledge graph initiatives. I did a quick skim of all the referenced works to check for references to RDF technology and found that Thomson Reuters has got RDF in that mix, which didn’t surprise me. The big surprise for me was Pinterest not only using RDF but using OWL. Put a pin in that!

Discussions of the work at the other companies didn’t mention RDF, but most were fairly high-level discussions, so I’m guessing that some of them use RDF and some don’t. (While the referenced AstraZeneca article didn’t mention RDF, I know that as a customer of Allegrograph, Ontotext and TopQuadrant—for whom I did training at AstraZeneca—they have been RDF fans for a while, especially because of the data integration possibilities.)

Knowledge graphs are not as synonymous with RDF as those of us with the aforementioned glasses might like to think. In fact, knowledge graphs currently looks bigger than that, and it’s easy enough to picture companies like Google, Facebook, and eBay defining their own data structures and schema languages to build graphs with no reference to the relevant W3C standards. I don’t think this is necessarily a bad thing; it looks like a pretty big tent.

The vision thing

I described how I became suspicious of selling RDF technology by building a buzz phrase vision around it and starting the marketing pitch with that. The pleasant surprise in my study of the knowledge graph world is that it was built around currently important ideas and needs, not any specific technology, and RDF-related technology turns out to provide excellent, standardized, widely implemented open source and commercial support for the implementation of knowledge graphs. In other words, this newer vision came along fairly independently of RDF and happens to be a great fit for it, so I’ll just try to be grateful. I’ll still be a bit self-conscious when I insert the phrase “knowledge graph” into a technology discussion that I would have written anyway even if knowledge graphs weren’t so hot—as if I were jumping on the bandwagon—but it’s a pretty nice bandwagon, and RDF people have a lot that we can contribute to it.