When I wrote about my first deep dive into Knowledge Graphs, I mentioned that although the term was around well before 2012, the idea of a Knowledge Graph was blessed as an official Google thing that year when one of their engineering SVPs published the article Introducing the Knowledge Graph: things, not strings. This blessing gave some focus to many members of the graph database community because they could say that what they had been doing was similar, if not the same, as what Google was doing.

I still didn’t think of the Google Knowledge Graph as a specific thing, but as more of a marketing term describing a set of technologies, like IBM’s Watson. I have changed my mind: in Pascal Hitzler’s A Review of the Semantic Web Field in the Communications of the ACM I learned that there is an actual, RESTful Google Knowledge Graph Search API, and I’ve been having some fun pulling Turtle RDF triples out of it.

That Google page demonstrates what you can put in a URL to request JSON-LD data from their Knowledge Graph. Their first example sends a search for “Taylor Swift”; below I have used that example with curl and piped the output through the Jena riot command line utility (not to be confused with DJ Jenna Riot, who I just learned about in a web search) so that I could get Turtle triples of the result. I won’t even bother showing the JSON-LD version here because I can get the Turtle version with this single command:

curl \
  "https://kgsearch.googleapis.com/v1/entities:search?query=taylor+swift&key=API_KEY&limit=1&indent=True" \
  | riot --syntax=JSONLD --output=turtle

Two notes about this command line:

  • I substituted my own API key for “API_KEY” above. You can get your own at API Key by filling out a few forms.

  • When you feed RDF to riot, it can usually guess the serialization from the end of the input filename, but when piping data to it from stdout like I do above, you need the --syntax parameter to tell it what flavor of RDF you are feeding it.

That command gave me 14 triples, including these:

<http://g.co/kg/m/0dl567>
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://schema.org/Thing> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://schema.org/Person> ;
        goog:detailedDescription  _:b2 ;
        <http://schema.org/description>  "American singer" ;
        <http://schema.org/name>  "Taylor Swift" ;
        <http://schema.org/url>   "http://www.taylorswift.com/" .

_:b2    <http://schema.org/articleBody>  "Taylor Alison Swift is an American 
            singer-songwriter. Her narrative songwriting, which often takes 
            inspiration from her personal life, has received widespread 
            critical praise and media coverage.\n" ;
        <http://schema.org/url>  "https://en.wikipedia.org/wiki/Taylor_Swift" .

The Wikipedia page for the now-defunct Freebase database tells us that “On 16 December 2015, Google officially announced the Knowledge Graph API, which is meant to be a replacement to the Freebase API”, so I’ve been missing out on this for a while. The Taylor Swift data above includes an interesting bit of the Freebase legacy: the local name of the URI used to represent her as a resource in the Google Knowledge Graph is m/0d1567, which we can see on her Wikidata page was the identifier that Freebase used for her. For people, places, and things that were not represented in Freebase at the time that Freebase shut down in 2016 (for example, Lil Nas X, whose Wikipedia page shows no Freebase identifier and says that he has been active since 2018) I assume that some Google algorithm just generates new identifiers in their Knowledge Graph.

More query API options

You can pick apart the URL with the Taylor Swift query and then reassemble it with new pieces using the Google Knowledge Graph API Reference. For instance, that query has a limit value of 1, but the API reference tells us that this can be up to 500, with a default value of 20. The reference page also includes a form you can fill out with sample API call parameters to learn about them more interactively than you would by revising a curl command over and over.

A more interesting option for the query URL is types, which lets you limit your search to entities of one or more specified schema.org types. For example, a query that uses parameters of query=charles+schwab&type=Corporation returns information about the company with that name, but query=charles+schwab&type=Person returns information about its founder. (Because types is plural you can also specify a comma-delimited list as that parameter’s value.)

With no limit parameter in the URL, the query about Charles Schwab the person actually returned eight people: Charles R. Schwab, the founder of the financial services firm; Pennsylvania steel magnate Charles M. Schwab; Émile Martin Charles Schwabe, a Swiss Symbolist painter and printmaker, and five other people.

This brings me to a few triples returned by my command line above that I didn’t show in the Taylor Swift example. Because the request sends a query to Google, just like a search entered at www.google.com, the server actually returns a list of search results. Here is the beginning of the Turtle version of the search result for the person Charles Schwab:

_:b0    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://schema.org/ItemList> ;
        <http://schema.org/itemListElement>  _:b1 ;
        <http://schema.org/itemListElement>  _:b2 ;
        <http://schema.org/itemListElement>  _:b3 ;
        <http://schema.org/itemListElement>  _:b4 ;
        <http://schema.org/itemListElement>  _:b5 ;
        <http://schema.org/itemListElement>  _:b6 ;
        <http://schema.org/itemListElement>  _:b7 ;
        <http://schema.org/itemListElement>  _:b8 .

_:b1    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> goog:EntitySearchResult ;
        goog:resultScore            1.105882568359375E3 ;
        <http://schema.org/result>  <http://g.co/kg/m/028lhc> .

<http://g.co/kg/m/028lhc>
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Thing> ;
        goog:detailedDescription  _:b9 ;
        <http://schema.org/description>  "American magnate" ;
        <http://schema.org/name>  "Charles M. Schwab" .

The first instance in the data is an item list. This points at instances of entitySearchResult; the first of these has the blank node _:b1 as its identifier. This search result points to information about the steel magnate, which identifies him with his Freebase ID, and it also has a search result score.

The API documentation tells us that the result score is “an indicator of how well the entity matched the request constraints”. I imagine that this is not simply a score of string similarity but also takes into account the popularity of each search result—otherwise, I don’t know how the result score would be 12 for financial services firm founder Charles R. Schwab, 1.1 for steel magnate Charles M. Schwab, and 6 for Swiss symbolist Émile Martin Charles Schwabe.

Linking that data

The Google Knowledge Base API doesn’t return a large amount of data for each entity, but when you have the Freebase ID, you can use it to retrieve additional data about that entity from Wikidata. The following simple little Wikidata query (try it here) uses the Freebase ID that we saw above for steel magnate Charles Schwab to pull down 140 triples about him from Wikidata:

CONSTRUCT {?s ?p ?o } WHERE {
   ?s wdtn:P646 <http://g.co/kg/m/028lhc> ;
      ?p ?o .
  }

Exploring for more data

The Google Knowledge Graph API includes a boolean prefix parameter that “[e]nables prefix (initial substring) match against names and aliases of entities”. The following asks for all entities of type MusicGroup whose name begins with “bea”:

curl \
  "https://kgsearch.googleapis.com/v1/entities:search?prefix=true&query=bea&types=MusicGroup&limit=500&key=API-KEY" \
  | riot --syntax=JSONLD > beagroups.ttl

The 481 results included the Beatles, Beach Boys and Beastie Boys, as I expected.

I was wondering if a sorted list of result scores would reveal any pattern, and then I realized, duh, I can write a SPARQL query to do that; it’s why I pulled the data as triples! (I could execute a query against the JSON-LD, but I prefer to work with Turtle because it’s easier to read.)

PREFIX s: <http://schema.org/>

SELECT ?resultScore ?bandName WHERE {
  ?result      <http://schema.googleapis.com/resultScore> ?resultScore ;
               s:result ?musicGroup .
  ?musicGroup  s:name ?bandName . 
}
ORDER BY DESC(?resultScore)

Here are the first few results when running this query against the RDF of “bea” music groups that the curl command above pulled down:

---------------------------------------------------------------------------------------------
| resultScore         | bandName                                                            |
=============================================================================================
| 2.518089111328125E3 | "The Beatles"                                                       |
| 3.5488818359375E2   | "Beastie Boys"                                                      |
| 1.969714050292969E2 | "Beak"                                                              |
| 1.761080169677734E2 | "Beatrice"                                                          |
| 1.361932220458984E2 | "Brooklyn Bounce"                                                   |
| 1.338223876953125E2 | "Battle Beast"                                                      |
| 1.335271911621094E2 | "Beatsteaks"                                                        |
| 1.331909942626953E2 | "Beartooth"                                                         |
| 1.256562881469727E2 | "Beady Belle"                                                       |
| 1.212170104980469E2 | "Beatfreakz"                                                        |
| 1.101853561401367E2 | "The Trammps"                                                       |

(Yes, the Trammps, of Disco Inferno fame.) The Beach Boys ranked at 111, well below many groups I’ve never heard of that, like the Trammps, didn’t even have “bea” anywhere in their name: Vansire? The Parlotones? Turbotronic?

The ability to pull typed data directly from Google’s Knowledge Graph is pretty great, especially since we can link much of that data to other good data sources. I had considered titling this blog entry “Piping data to stdin of Jena’s riot utility” (talk about your clickbait!) but as you can see decided to go with the Knowledge Graph angle—not because this term is a popular way to talk about graph databases in general, but because we’re pulling data from the graph that Google itself is calling a Knowledge Graph.

Still, this ability to feed data to riot via stdin is pretty nice, and it smooths a key handoff of this trick the old-fashioned UNIX way. When these pieces are all assembled together like this, they make it easier to incorporate Google Knowledge Graph data into the wide range of RDF-based tools that are out there. It will have many great applications.