Normalizing company names (and more) with SPARQL and Wikidata

As a service!

March 26, 2023

Several years ago I wrote Normalizing company names with SPARQL and DBpedia to describe how SPARQL queries to DBpedia let you take advantage of the Wikipedia logic that redirects the URL http://en.wikipedia.org/wiki/Big_Blue to https://en.wikipedia.org/wiki/IBM and http://en.wikipedia.org/wiki/Bobby_Kennedy to http://en.wikipedia.org/wiki/Robert_F._Kennedy. This lets SPARQL queries normalize names—a useful task to perform for data cleanup.

This time I did it with Wikidata. As with the DBpedia version, I did it using a SERVICE call to Wikidata so that a query that is running somewhere besides Wikidata can take advantage of this. This time I also showed how to make it work for countries as well as companies. Minor changes should make it work for most other classes.

Because my new version uses alternative names instead of redirect data, the name of the company that it returns is the name on the Wikipedia page, not the official name of the company. So, for example, instead of “Eastman Kodak Company” it will return “Kodak” and instead of “Apple, Inc.” it will return “Apple”. Still, the normalization down to a single name is useful.

Here is the query I entered into the Wikidata Query Service to send to its endpoint. You can also see and execute the query with this query link.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd:   <http://www.wikidata.org/entity/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?wdName                   # Find the Wikidata name
WHERE {
  # A few test cases to choose from
  BIND("Big Blue"@en AS ?englishName)    # Search based on a nickname.
  #BIND("IBM"@en AS ?englishName)         # Search based on Wikipedia name.
  #BIND("Apple Inc."@en AS ?englishName)  # Search based on official name.

  BIND (wd:Q4830453 as ?entityClass)      # Business is the entity class...
  ?entity wdt:P31 ?entityClass .          # that we want to search.
  
  # Union of two sets of triples: entities that have the input name as
  # an alternative name and those that have it as their official name. 
  {       
    ?entity skos:altLabel ?englishName ; 
            rdfs:label ?officialName .
    FILTER ( lang(?officialName) = "en" )
  }
  UNION
  { ?entity rdfs:label ?englishName . }
  
  # Get the official name if it was bound, otherwise the
  # the English name if part 2 of the UNION found it. 
  BIND(STR(COALESCE(?officialName,?englishName)) AS ?wdName)
}

I won’t discuss the query much here because it’s heavily commented. You can also read about the query logic in my earlier article on doing this with DBpedia (especially the use of the under-appreciated COALESCE function). I will say that, although Wikidata sometimes uses its own vocabulary to express ideas that could have used basic parts of the standard (for example, using wdt:P31 instead of rdf:type to show class membership), it’s nice to see the RDFS and SKOS vocabularies used as part of the data I was retrieving.

As a SERVICE…

You can run the query below with any SPARQL processor that has access to the Internet so that it can make the SERVICE call. I ran it locally with arq in a query that demonstrates batch processing of names to normalize.

This query is more flexible than the one above, letting you disambiguate terms from different classes—in my example, both company and country names. (I tried it with people as well, but there are just too many famous people who share their name with someone else such as the various Michael Jordans and Dave Thomases.) You just need to include the class URI in the input.

Here is the sample input, listing some entities by their Wikidata names and some by alternative names that they are known for.

@prefix s:  <http://learningsparql.com/ns/sample/> .
@prefix wd: <http://www.wikidata.org/entity/>

s:company1 a wd:Q4830453;
           s:name "Kodak" .

s:company2 a wd:Q4830453;
           s:name "Big Blue" .

s:company3 a wd:Q4830453;
           s:name "Coca Cola" .

s:country1 a wd:Q6256;
        s:name "The UK" .
        
s:country2 a wd:Q6256;
        s:name "Nigeria" .
        
s:country3 a wd:Q6256;
        s:name "U.S." .

The query is a CONSTRUCT query, so it will return triples: the original data, the Wikipedia name, and the Wikipedia qname so that something further down the processing pipeline can retrieve more data from Wikipedia about the entity.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd:   <http://www.wikidata.org/entity/>
PREFIX wdt:  <http://www.wikidata.org/prop/direct/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX s:    <http://learningsparql.com/ns/sample/>

CONSTRUCT {
  ?entity rdfs:label ?wdName ;
          s:name ?name ; 
          s:wikidataURI ?wikidataEntity . 
}
WHERE {
  ?entity a ?entityClass;
          s:name ?name . 
  
  BIND(STRLANG(?name,"en") AS ?englishName)
  
  SERVICE <https://query.wikidata.org/sparql> 
  # Look for something with that name and entity class. 
  {
    ?wikidataEntity wdt:P31 ?entityClass . 
    {       
      ?wikidataEntity skos:altLabel ?englishName ; 
         rdfs:label ?officialName .
      FILTER ( lang(?officialName) = "en" )
    }
    UNION
    { ?wikidataEntity rdfs:label ?englishName .
      FILTER ( lang(?englishName) = "en" )
}
  }
  BIND(STR(COALESCE(?officialName,?englishName)) AS ?wdName)
}

The SERVICE clause passes off some of the logic to happen on Wikidata, and the rest executes locally with my copy of arq. The rest of the syntax is very close to the more heavily-commented example above.

Here are the results I get:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix s:    <http://learningsparql.com/ns/sample/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix wd:   <http://www.wikidata.org/entity/> .
@prefix wdt:  <http://www.wikidata.org/prop/direct/> .

s:company1  rdfs:label  "Kodak" ;
        s:name         "Kodak" ;
        s:wikidataURI  wd:Q486269 .

s:country2  rdfs:label  "Nigeria" ;
        s:name         "Nigeria" ;
        s:wikidataURI  wd:Q1033 .

s:company2  rdfs:label  "IBM" ;
        s:name         "Big Blue" ;
        s:wikidataURI  wd:Q37156 .

s:country3  rdfs:label  "United States of America" ;
        s:name         "U.S." ;
        s:wikidataURI  wd:Q30 .

s:country1  rdfs:label  "United Kingdom" ;
        s:name         "The UK" ;
        s:wikidataURI  wd:Q145 .

s:company3  rdfs:label  "The Coca-Cola Company" , "Coca-Cola Hellenic" ;
        s:name         "Coca Cola" ;
        s:wikidataURI  wd:Q3295867 , wd:Q1104910 .

The fact that the “Coca Cola” entry returns two companies shows that this may not completely normalize a given name. We can automate the identification of which of the output entities have more than one Wikidata name and therefore need some review with this query on the output of the CONSTRUCT query above:

PREFIX  s:  <http://learningsparql.com/ns/sample/>

SELECT ?name (COUNT(?uri) AS ?uriCount)
WHERE {
  ?s s:name ?name ;
     s:wikidataURI ?uri 
}
GROUP BY ?name
HAVING (?uriCount > 1)

Or, you could try to find some logic related to how Wikidata models companies as a way to pick just one of the companies—or countries, because this issue comes up with them as well. The nice thing is that it the query can work with different classes of entities, so it provides a foundation to build on.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.

Selecting all the triples from all the graphs

Editing schemas, ontologies, and SKOS taxonomies with VocBench

SPARQLing anything

Querying for audio on Wikidata

Use SPARQL to query for movies, then watch them

SPARQL queries of the Billboard Hot 100

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

blog

home