Let's stop saying 'semantic web'

Like a startup pivot, the technology turned out to be great for things other than a new kind of 'web'.

“Semantic web technology” refers to technology designed to create something that never got created. That’s OK. Lots of great things were and continue to be created.

The (World-Wide) Web

Tim Berners-Lee created the original World Wide Web in 1990 by assembling five things that he had developed:

  1. A simple protocol for different systems to request and deliver resources (usually, text files) over a network: HTTP

  2. A syntax to uniquely identify such resources: URLs

  3. A markup language to represent simple yet structured documents in these delivered files, in which an important part of the structure was the ability to represent hypertext links to other such documents: HTML

  4. A program that responded to HTTP requests by sending the requested file or delivering an appropriate HTTP status code if unable to deliver it: a web server

  5. A program that could request a document from one of these servers and, if it was an HTML document, render it on a screen with headings, bulleted and numbered lists, hypertext links, and other document components displayed appropriately for humans to read it: a web client

Eventually, servers loaded with documents full of links, and clients to request and read them, accumulated into the World Wide Web as we know it.

The “Semantic” web

The next idea was to build on these things (mostly 1, 2, and 4) to create a web of machine-readable data that would complement the human-readable web of data. The new web would use a simple yet flexible standardized data model (RDF) with universal identifiers modeled on the URLs from component 2 above to make, in addition to documents, machine-readable data available to any program that could use HTTP across the same network used for the World-Wide Web. This might be a browser, but could also be a simple Perl or Python script.

As a bonus, this new web would use recent knowledge representation advances to store bits of meaning about the terms in any given data model. For example, if the model says that the property locatedIn is transitive and a certain piece of inventory is located in room A and room A is located in building B, an automated system could infer that this piece of inventory is in building B, and you could get more out of your data.

A model could also say that Employee is a subclass of Person. This provides some semantics of these terms but was a data modeling capability that most people already took for granted from other systems. The ability to say that creationDate was a subproperty of modifiedDate was new for most people with an object-oriented background, but I have not seen this latter capability get a lot of use. (I’m sure there are plenty of usage examples out there, but they don’t cross my path much.)

This ability to store bits of meaning, or semantics, led people to start calling this potential network of data files the “semantic” web. Over time, not many people and systems took advantage of this aspect of the new web, so it turned out that this was not a great name for it. There were other issues as well.

A web of machine-readable data?

Like most Internet resources, there are two approaches for making RDF data available on the web: as static files or as dynamically generated data.

There are plenty of static RDF files out there, but not with connections between them that would form any kind of web the way that HTML files link to each other. In the early days of RDF technology, people talked about Friend of a Friend files as a step toward replacing commercial social networking sites with our own decentralized RDF-based version. In this version, our FOAF file’s triples would describe the personal data that we want to make available and also link to the FOAF files of our friends. My FOAF file is still on the hosting service that I use, and so is Tim Berners-Lee’s; as they show, we’re both friends of Norm Walsh, making Tim a friend of my friend. In my book Learning SPARQL, example ex166.rq uses Berners-Lee’s FOAF file to demonstrate how the FROM keyword can retrieve data from remote resources.

But, as I wrote fifteen years ago, “actual FOAF files have been used for little more than demos”, and I don’t know of any other sets of RDF files available on the public web that form any kind of useful network. (I should mention that Berners-Lee’s company Solid is doing interesting work helping people and organizations to share data in ways that let them control their privacy, all with RDF underneath.)

The best way to share RDF data dynamically is by making it available over a SPARQL endpoint. Wikidata and DBpedia are two amazing examples of SPARQL endpoints that have given so much to so many people in so many disciplines. For many people, their whole motivation for learning SPARQL was to gain programmatic access to the data that these two endpoints provide.

But, despite the growth of the Linked Data Cloud over the years, outside of the life sciences world there don’t seem to be many endpoints available besides Wikidata and DBpedia anymore. Sure, there are some, but classic ones from brand name organizations like http://data.nytimes.com/ and http://nasataxonomy.jpl.nasa.gov/fordevelopers/ no longer work. After clicking many random nodes on the Linked Data Cloud diagram to find the relevant endpoints, I see the FAIL triangle “!” icon on nearly every SPARQL endpoint’s entry, like this:

Of course, organizations like the U.S. Weather Service that want to dynamically generate machine-readable data these days generate JSON, and rarely JSON-LD. There are no persistent URIs or references to other persistent URIs. Each one is basically an API to a silo, but as public APIs they are still performing a valuable service. They’re just not part of any kind of “semantic web”.

There are still some nice SKOS and data model RDF datasets available, like those at the Library of Congress and AGROVOC; these are good to push because they encourage data interoperability. And, schema.org has provided an excellent, RDF-based data model that (because of the ease with which RDF data models can be extended) has provided many data projects out there with a nice dose of interoperability.

The pivot

Does this mean that the technology was a failure? Not at all. I see it as a very successful pivot—a re-application of one or more of an organization’s technologies to a new domain to address new use cases, like Twitter and Slack did in their early days. Instead of creating a web of interconnected public RDF datasets that are spread around the world, organizations have created their own internal webs that aren’t necessarily “semantic” but make it easier to share and evolve collections of datasets within those organizations. The data may be stored locally or using a cloud provider like AWS. There are a wide variety of commercial and open source triplestores to store this data as well as tools to create SPARQL endpoint front ends to relational database managers so that legacy relational data can contribute to these internal webs.

How about the term “Linked Data”? There isn’t much public RDF data to link to outside of the data behind the endpoints mentioned above. But, the globally unique nature of URIs make it easy for a resource’s triples within one dataset to reference a resource in any other accessible dataset, which may be public or may be stored on another server behind the same firewall as the referencing document. I think that “Linked Data” is a perfectly good term to describe the modern version of this, but it is starting to feel a bit old.

Twelve years ago I described how the second edition of my book “Learning SPARQL” had “55% more pages! 23% fewer mentions of the semantic web!” This was a bit of a joke, but even then I could see the pattern of RDF technologies (the term that I was already using instead of “semantic web”) doing quite well without being focused on a semantic version of the World Wide Web.

After I was nearly done drafting this blog entry here I watched a YouTube interview with Andreas Blumauer, the Senior Vice President of Growth at my employer Graphwise (and the former CEO and co-founder of PoolParty, which merged with Ontotext to form Graphwise). At one point he says “It shouldn’t be called ‘semantic web’ anymore but ‘semantic enterprise standards’ because we see a lot of that adoption of RDF technologies, SKOS, SPARQL – all that is now really all around in enterprises which want to implement an enterprise knowledge graph. So that’s pretty much the same technology under the hood and enterprise knowledge graphs have adopted the semantic web standards.” It was nice to see him confirm these ideas that I had been drafting notes about. The public knowledge graph (as we would now call it) that was part of the dream of the semantic web didn’t really happen, but a large, growing number of enterprises are seeing the benefits of having their own enterprise knowledge graphs. (Also, he and I have never discussed this, so it was nice to see him use my preferred term of “RDF technologies” there.)

Semantics sneaking in there after all?

By taking text analysis technology that both PoolParty and Ontotext brought to the merger and combining it with LLMs, the “meaning” of words is now playing a growing role in the resulting applications as we use knowledge graphs to address tasks like hallucination minimization. This is a different kind of semantics from the “semantic web” kind, as I explained in Semantic web semantics vs. vector embedding machine learning semantics nine years ago, and it’s the kind that provides the foundation for the large language model usage that is apparently what “AI” means these days. Like with other aspects of the pivot, it’s not really what was planned but has worked out well, making all kinds of great applications possible without much usage of the OWL classes and properties that were originally supposed to provide the semantic web’s semantics.


Comments? Reply to my Mastodon or Bluesky posts announcing this blog entry.