I recently asked if anyone knew of applications that pull
meta[@name and @content] metadata out of HTML
head elements, and I got a few interesting answers. To extract such data, writing a short XSLT stylesheet that reads the output of John Cowan’s TagSoup would be easy, but lately I’ve been thinking: with a slight change to those
meta elements, they’d be RDFa, which can store more versatile metadata that is easier to get out (see Getting Those Triples).
For those interested in seeing more RDFa `meta` elements in their HTML `head` elements, the difficult work has already been done.
For those interested in seeing more RDFa
meta elements in their HTML
head elements, the difficult work has already been done. Many HTML generation routines out there have code to find the value that goes with a certain name (typically, a Dublin Core property name) and then insert the
meta element with the name/value pair. Minimal changes to this code can change it to output RDFa instead. For example, the DITA Open Toolkit is an open-source package that converts base or specialized DITA content to HTML, XHTML, Java help, RTF, troff, PDF, and more formats. The HTML generation part includes a get-meta.xsl stylesheet that inserts the
meta elements, and I’ve created a revised version called get-meta-rdfa.xsl that inserts RDFa
meta elements instead. If you point the DITA Open Toolkit jar file at a stylesheet that just has an
xsl:import instruction pointing at get-meta-rdfa.xsl, you’ll get all of the Toolkit’s default HTML generation with the RDFa
meta elements substituted for the default ones. For example, instead of this,
<meta name="DC.Title" content="My Topic" />
you get this:
<meta property="dc:title" content="My Topic" />
It also adds namespace declarations for Dublin Core, Dublin Core basic terms, and PRISM, because those were the most appropriate vocabularies for the terms being added. I didn’t see any opportunities to add triplets that would have URLs as the objects, which would look more like this:
<link rel="dc:identifier" href="http://www.snee.com/bobdc.blog/2007/08/who_uses_metadata_from_html_he.html"/>
You can see an example of the HTML created by the Toolkit with this module here (the look is pretty minimal—while you’re customizing the HTML generation code, you might want to point it at a CSS stylesheet as well) and the RDF triples extracted from that by triplr.org here.
If you’re a fan of RDFa, find some HTML generation code out there and write or revise a module to have it add some RDFa metadata. Like I said, the code has probably already been written to do the difficult part—actually identifying the name/value pairs—so you just need to revise that code to output the slightly different syntax and add a
meta element wrapper for the revised
meta elements to store the subject of the triples and the namespace declarations. (I did this for this weblog’s Movable Type templates several months ago.)
This is very cool stuff, Bob! I also took from your example the RDFa I would need to insert in my Blogger template. I was hoping to co-exist the RDFa metadata and the Dublin Core recommendation for expressing metadata, but the Triplr.org app didn’t seem to like it:
Parsing http://shudson310.blogspot.com/index.html content with ‘rdfa’ parser failed with errors:
line 1: XML parser error: EntityRef: expecting ‘;'
line 1: XML parser error: EntityRef: expecting ‘;’
Any ideas what the issue is here?
Your index.html file isn’t well-formed. The permalink versions of my postings have a DOCTYPE declaration and fail validation because of the RDFa meta elements, but they’re still well-formed, so triplr is OK with them.