Entity recognition from within a SPARQL query

Using my new employer's excellent free product.

Ontotext and spacy logos

I recently announced that I have joined Ontotext as a full-time Senior Tech Writer. I have admired their free GraphDB triplestore for a long time (for example, I wrote about how well it supports the GeoSPARQL geospatial extension in October of 2020) and I am now learning about all the great capabilities of their commercial products, such as the scalability of GraphDB Enterprise.

As always, though, in this blog I will focus on free RDF-related software, so this month I will write about a cool feature of GraphDB Free that I just learned about just last week: its use of the spaCy library to let you do text analysis and entity recognition from within a SPARQL query.

The Text Mining Plugin page of the GraphDB documentation describes text mining protocols that it supports: spaCy, GATE Cloud, and Ontotext’s Tag API. The spaCy section of that page shows the two lines necessary to create and then run a spaCy client with docker, and then it shows a SPARQL INSERT DATA command that establishes a connection from GraphDB to the spaCy client. Once that’s done you’re ready to run queries that tell spaCy to analyze content that you pass to it.

The Find spaCy entities through GraphDB section that follows that shows a query that passes a paragraph of text about Dyson Vacuum Cleaners to spaCy and and returns several columns of information about how spaCy annotated it to indicate the entities that it found. Beneath that on theText Mining Plugin page you can see the results: it identifies “Dyson Ltd.” as an organization, James Dyson as a person, Singapore as a geopolitical entity, and more. (While that documentation shows six of the returned rows, I got twelve when I ran it.)

That query was a SELECT query. I wanted to run a CONSTRUCT query that would create new triples about some of the identified things. If it recognized people, places, and organizations, I wanted it to create triples making those instances schema.org classes. Revising the SELECT query mentioned above, I ended up with this:

# getting triples from endpoint with this query: 
# curl -H "Accept: text/turtle" --data-urlencode \
# "query@spacytest.rq" http://bob-inspiron:7200/repositories/my_repo

PREFIX txtm:      <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX s:         <https://schema.org/>
PREFIX rdfs:      <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc:        <http://purl.org/dc/elements/1.1/>

# Don't forget to start the spaCy server and run the INSERT query
# that establishes the connection to it before running this query. 

        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText .

        ?entityID a ?soClassname ; 
                  rdfs:label ?annotationText . 
        # The annotation has a related resource: this
        # new resource being declared. 
        ?annotation dc:relation ?entityID .
  ?searchDocument a txtm-inst:localSpacy;
     txtm:text '''Dyson Ltd. plans to hire 450 people globally, with
     more than half the recruits in its headquarters in Singapore.
     The company best known for its vacuum cleaners and hand dryers will
     add 250 engineers in the city-state. This comes short before the founder
     James Dyson announced he is moving back to the UK after moving residency
     to Singapore. Dyson, a prominent Brexit supporter who is worth US$29
     billion, faced criticism from British lawmakers for relocating his
     company''' .

    GRAPH txtm-inst:localSpacy {
        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText ;
                    txtm:annotationKey ?annotationKey;
                    txtm:annotationType ?annotationType ;
    VALUES (?annotationType ?soClassname) {
      ("ORG"    s:Organization) 
      ("GPE"    s:AdministrativeArea)
      ("PERSON" s:Person)

    # Create a URI to use as the subject of each newly
    # recognized entity being declared as a schema.org class. 
    BIND(UUID() AS ?entityID)

The WHERE clause grabs the information generated by spaCy like the WHERE clause in the original SELECT query in the GraphDB documentation does. It also uses SPARQL’s VALUES clause to map spaCy annotation types to schema.org classes. (With more input text, I’m sure spaCy would recognize more types of entities, so you could easily extend this VALUES list to accommodate those.) Then instead of a SELECT clause, I have a CONSTRUCT to create triples saying that the recognized entities are instances of the appropriate classes.

This is only a beginning. For example, spaCy recognizes Singapore as a geopolitical entity in two different places, but it doesn’t know that the two identified entities are the same thing, so my query creates a separate s:AdministrativeArea instance for each. There are tools that could be used further down the pipeline to straighten this out and maybe connect it to http://www.wikidata.org/entity/Q334, the Wikidata identifier for Singapore; because this CONSTRUCT query creates triples instead of a table of results, it will be much easier to pass the result of its work down a pipeline to other tools that can do further enhancements.

Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.