TagSoup 1.0 released

A milestone for a very useful open source XML utility.

John Cowan recently announced the availability of release 1.0 of TagSoup, his Open Source Java tool that parses even the ugliest HTML and lets you treat it like well-formed XML.

This single jar file does a lot for a 50K file. Although it started off purely as a library with an API interface, it eventually acquired a command line interface. Enter a command like the following (without the carriage return) at your operating system prompt to create an XHTML version of the input:

java -jar tagsoup-1.0.jar 
  http://home.ccil.org/~cowan/XML/tagsoup/extreme.html

The input in the example above is John’s file of particularly “evil” HTML; you can pass a local filename as the argument as well. The wide selection of optional command line parameters include various options for dealing with “bogons,” or unknown elements: you can tell TagSoup to leave them alone (other than ensuring that they’re well-formed), delete them, or render them empty with their contents moved outside of their tag boundaries.

Dave Raggett’s HTML Tidy program has been justly popular for cleaning up messy HTML, but it can be a bit picky. The Tag Soup motto is “Keep On Truckin’,” and it will forge ahead to do its best with whatever you give it. (Try a View Source of the evil.html mentioned above to see the kind of HTML that it valiantly navigates.) The TagSoup home page further describes its differences from HTML Tidy.

A companion to TagSoup is TSaxon, a repackaging of version 6.5.3 of Michael Kay’s Saxon XSLT 1.0 implementation that includes TagSoup. (I would have posted about TagSoup 1.0 earlier, but John was straightening out some jar packaging problems with TSaxon.) Point TSaxon at an XSLT stylesheet and some ugly HTML, and the TagSoup parser will clean up the HTML before passing it along to Saxon to have the stylesheet applied. For example, the following (without the carriage returns), when using the TSaxon version of saxon.jar, adds id attributes to block elements of the cleaned-up version of evil.html:

java -jar saxon.jar -H 
  http://home.ccil.org/~cowan/XML/tagsoup/extreme.html 
  http://www.snee.com/xml/xslt/addids.xsl

John has worked on this for four years, and was careful not push it along any faster than it deserved to go (I wanted to write about TSaxon in my XML.com column on XSLT, but he insisted that it wasn’t ready yet), so reaching 1.0 is really a milestone for TagSoup. It’s quite a gift to people who do screenscraping or any kind of beating into shape of messy HTML content.

1 Comments

By Max Völkel on July 6, 2006 9:12 AM

You should also have a look at CyberNeko, that libary by Andy Clark is well-maintained and well-performing.