I’ve been having some fun with Apache Tika lately. According to its homepage, the “Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.” It managed to pull some sort of metadata out of just about any file I’ve pointed to; the list of formats it can handle includes PDF, JPEG, MP3, ePUB, Flash video files, Microsoft Office files, OpenOffice files, and more. What’s especially cool to me is that it can extract RDF.
Tika can run as a server, or as GUI window that you drag files to, but I’ve mostly played with the command line version. Running it with an argument of
--help lists the available output options:
java -jar tika-app-1.0.jar --help
-y option tells it to output XMP data, which comes out as RDF/XML. I’ve written here about XMP several times before. It’s basically an Adobe spec for media metadata expressed as RDF/XML. Being RDF/XML, any semantic web tool should be able to read it. The bad news is that, by explicitly targeting XMP, this Tika output only includes metadata defined in the relevant Adobe namespaces. Specifying the
-j switch instead tells Tika to give you JSON output, and you get a lot more metadata. It would be nice if Tika included an
-r switch to output all the metadata it can find—the same that it outputs when you request JSON output—as RDF/XML or Turtle. They’ve obviously already done the hard parts.
Why is Tika’s ability to output media metadata in RDF so interesting to me, especially if it could someday output all the same properties in RDF that it can now output in JSON? Because different media have different metadata properties (for example, an MP3 file has different metadata from a JPEG file) and one of the greatest strengths of the RDF data model is the way it lets you accumulate property-value pairs for resources without knowing which properties you’re going to gather in advance. So, let’s say I wanted to create an application around a single set of metadata that describes a particular collection of images, music files, and related documentation. Tika plus a few selections from a wide variety of standards-compliant semantic web software, such as TopQuadrant’s TopBraid platform and many other tools, would make this almost trivial. Of course, some extra RDFS modeling around the stored properties would add more, but Tika, a triplestore, and very little else would give you enough to be off and running with a very powerful application.