Earlier this week I wrote about my frustration with metadata as data about data that may never exist—data that ontology designers merely wish that someone else would create around their fabulous ontologies. Lately I’ve become interested in the more difficult but useful idea of designing ontologies around existing data in order to get more value from that data. In theory, RDF/OWL descriptions of separate, related data collections make it easier to use those collections together; how does this work in practice?
An interesting aspect of looking into new job opportunities last spring was researching what’s really going on in the XML and metadata world(s). Consultancies are getting more RDF/OWL work than I realized, especially in work for the U.S. government. I have a theory about this that I welcome people to debunk or, hopefully, back up with examples: one finding of the September 11th commission was that too much data was hidden in disparate silos in different government agencies, making it difficult enough to share this data that assembling the right pieces to prevent the wrong people from entering the country (or getting their boarding passes, or whatever) could have but didn’t happen. People interested in database integration technology could find “Homeland Security” funding, and it worked the other way as well: people with this problem to solve heard about a W3C standard technology that could help them and went looking into it.
Government and academic work
XML 2004 attendees will remember Department of Homeland Security Metadata Program Manager Michael Daconta’s keynote speech, which clearly demonstrated this interest. The Government Symposium on Information Sharing and Homeland Security recently held their fourth annual symposium. SICoP, or The Semantic Interoperability Community of Practice (“a Special Interest Group within the Knowledge Management Working Group sponsored by the Best Practices Committee of the Chief Information Officers Council, in partnership with the Federal XML Working Group”) seems to be doing interesting work in this area, although the full title on their home page (“Semantic Interoperability (XML Web Services) Community of Practice”) conflates two areas that most of us would consider to be completely separate—XML Web Services may play a role in Semantic Interoperability implementations, but they’re very different concepts. Maybe they wanted to justify the use of the domain web-services.gov.
Some web searches on database integration and RDF/OWL, DAML, or RDFS turned up some interesting work in academia. The paper “RDF/RDFS-based Relational Database Integration” (PDF) by four researchers at China’s Zhejiang University looks promising; it describes the rewriting of RDF SPARQL queries into a set of SQL queries and an application of their work at the China Academy of Traditional Chinese Medicine. “Knowledge Integration to Overcome Ontological Heterogeneity: Challenges from Financial Information Systems” (PDF) by the Sloan School of Management’s Aykut Firat, Stuart Madnick, and Benjamin Grosof discusses the use of RuleML and OWL’s predecessor, DAML+OIL, in MIT’s COIN (COntext INterchange) project, which seems pretty defunct as of today (dead home page, Google cache of it).
Database integration has been an important area of computer science research for decades, and finding more papers on this topic that mention RDF/OWL or its predecessors shouldn’t be difficult. The government work that I’ve found looks like mostly top-down, big-picture design work. This is important work to do, but I’d like to find smaller scale working examples of these ideas somewhere: for example, two or more separate collections of non-RDF data, an RDF/OWL ontology describing some of the data in those collections, and something that uses that ontology to use the separate collections together. (Using separate collections of RDF together is a piece of cake—that’s part of the point of RDF.)
I’d love to see two different relational databases that are used together with the help of RDF/OWL, but I’m not holding my breath, because database vendors have been solving this problem without RDF/OWL for years. If the problem held the interest of the Zhejiang University researchers, though, I’m not giving up hope. How about RDF/OWL to integrate two or more collections of XML? Some XML and some relational data? XML, relational data, and some industry-specific notation systems or proprietary formats?
If you know of anything, please let me know in comments here or via private email (see the bottom of my home page for the address). A description of something going on behind a firewall somewhere is better than nothing, but I’d prefer to hear about projects that anyone can reproduce on their own using free software.
I reckon things have got quite a bit further than you suggest. You might want to check out activity around D2RQ and SquirrelRDF, a couple more SPARQL/RDBMS bridges. Damien Steer has used the latter on HP’s internal LDAP directories, very cool.
A smallish-team commercial contract job I’ve been on for a while now involves the merging of medical patient data from different sources (two completely different SQL DBs for starters). Mapping to a common RDF model seems to work pretty well for the integration problem, SPARQL seems adequate for the necessary querying facilities required.
Cool, thanks Danny!
By Damian on August 24, 2006 4:03 PM
You might also want to look at darq, which does federated sparql queries. Bastien has been using this with SquirrelRDF over LDAP (iirc) and other data sources, some of which are RDB backed.
Regarding “database vendors have been solving this problem without RDF/OWL for years”:
That is true (I know, I’m doing it on a daily basis), but that doesn’t mean that new and better ways and methods will never show up.
I believe RDF/OWL provides a path to those new and better solutions, but it’s arguably not as solid and mature as (relational or otherwise) databases…\
Tell me more about this path…\
Bob, you might be interested in D2R Server, which builds on the previously mentioned D2RQ. I’ve toyed around with integrating multiple D2R Server instances using Bastian’s DARQ and it works, at least in principle, although there are still scalability and performance issues all over the place.
A couple of other related projects are linked on the ESW wiki.
Just to chime in, Clark & Parsia is pretty active in the US federal gov’t space doing exactly what you outline, Bob. Which should come as no surprise to you, given yr new employer. :>
(Also, as to yr theory about the 9/11 stuff prompting integration: that’s pretty plausible; but inside of NASA, where we’ve done most of our RDF integration work, they’ve been worrying about their special data problems, including the silo issue, for a good long while. And the OMB’s DRM memos certainly have added additional impetus.)
In particular we’ve been pretty active with NASA doing database integration using RDF (though not OWL yet): the BIANCA project integrates several data sources to manage NASA networked assets (networks, servers, applications, and dependencies between these); and our POPS project is integrating disparate data sources (6 right now with probably dozens more in the next 18 months) using a virtual RDF federation model and a (novel) display client, JSpace, which is, essentially, a visual RDF query builder (for RDQL and iTQL, SPARQL to come later).
So, yeah, this stuff is happening, but it’s all rather under the radar in some sense.
As for OWL, I suspect we won’t really get into using it for db integration till we have some R&D funds to polish that approach first. There’s some good work done by some of the Europeans in using OWL to do align database schemas as represented by UML. But that work isn’t trivial and, so far, the RDF approach has been sufficient.
We are also using RDF/OWL ontologies for database integration, also at Government agencies (FAA, NASA). We have recently tooled up TopBraid Composer to support proxy mappings - see Holger’s blog entry http://composing-the-semantic-web.blogspot.com/2006/08/update-automated-database-import-into.html
for more information.
In one approach we are using datalog with OWL. A second approach uses SPARQL with OWL and D2RQ.
We´ve not being doing db integration directly but somehow indirectly. We´ve been integrating several sources of information by wrapping them around a very simple web services (SOAP-based and ReST-based). Each of the sources was providing data directly in RDF based on a one-to-one OWL mapping to their respective underlying data model (which in most of the cases was a db model). Then, by using cross-ontologies or interpretation ontologies, that is connecting similar concepts from different ontologies with respect to some “generic” vocabulary, we were able to make queries.
Those queries were based on the “generic” vocabulary and we were using Pellet and a specific algorithm to generate the proper target query that the different sources were able to understand. The point was to transform the source query to RDQL and then, from here, translate it to the target queries. Of course a limited set of query syntax were used and we focused on generating RDF queries (queries by example) and source-specific RDQL queries.
Once data in RDF were served from different sources (again in RDF based on sources specific OWL), we populated the Pellet ABox and the classification did its job. From here, we could make further refinements of the query (either based on the generic query or based on some more source-specific vocabulary).
There are a couple of initiatives in our organisation aimed to encourage the use of RDF and OWL so any system or data provider will be able to wrap-up some of its services and provide information directly in RDF.
Our point is not in making db integration directly but to facilitate systems and data provider developers the path to provide their information in RDF based on its sort of “private” ontology. The trick then is to make the connections of those disparate models. But, at least, you don´t have to re-program again if you´ve got yet another source in town. Just create your OWL model and provide RDF, make the proper interpretation entries, let the inference engine know about them and there you will have another data source to integrate with the rest.
It´s not the most advanced proposal, but we´re trying to make the case of interoperability.