Word 2003's awful XML for index elements

My "XML version of their RTF" joke has become too real to be funny anymore.

May 31, 2007

I’ve mostly watched the OpenOffice vs. Office Open XML debates as a spectator, but I have dealt directly with OpenOffice XML with some nice results. I dabbled with Word’s XML a bit and found at least one nice surprise, but I hadn’t waded in too deeply until recently, and now that I have, I’m pretty disappointed. Basic paragraph markup is pretty messy, and the markup of index terms is awful.

The w:p paragraph elements are split fairly arbitrarily into w:r elements. A Microsoft Overview of WordprocessingML tells us that w:r stores “A contiguous set of WordprocessingML components with a consistent set of properties”. That’s all it tells us. Next to this definition is a link to a special page for the r element that tells us nothing more about it but does tell us “For more information on this element, please refer to the VML Reference, located online in the Microsoft Developer Network (MSDN) Library”. It tells us this eleven times. Seriously. The w:r elements are broken up, arbitrarily as far as I can tell, into w:t elements, which are defined as “a piece of text”. (Not like all those other XML elements!) I have to wonder what the famous six thousand pages of documentation for this format actually say.

The w:r and w:t elements are annoying, but it’s not a lot of coding to just ignore them and concatenate their contents together. However, I don’t even want to try to write code that processes the XML version of index terms from a Word file. Here’s a sample, showing what happens when I inserted the index term “dogs” with a secondary term “beagles”:

<w:r><w:fldChar w:fldCharType="begin"/></w:r>
<w:r><w:instrText> XE "</w:instrText></w:r>
<w:proofErr w:type="spellStart"/>
<w:r wsp:rsidRPr="00DD6A97">
  <w:instrText>dogs:beagles</w:instrText>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r><w:instrText>"</w:instrText></w:r>
<w:r><w:fldChar w:fldCharType="end"/></w:r>

All of those w:r elements are siblings of all the other w:r elements in the same paragraph, so the only indication that the markup above is supposed to function as a single unit is the fact that one w:r element’s w:fldChar child (which the documentation says “Represents a field-delimiting character”) has a w:fldCharType of “begin” and another has a value of “end”. Since a test in a separate Word document shows that Word recognizes the words “dogs” and “beagles” as spelled properly but doesn’t recognize the string “dogs:beagles”, I’m guessing that the two w:proofErr elements are there because after Word put my primary and secondary index terms together with a colon delimiter, it didn’t recognize what it saw as a properly spelled word and marked the string as a misspelled one.

Looking at the original Word file, I suppose that the field-delimiting characters in question are curly braces, and the value of the w:instrText element (which the doc says “Represents field instruction content”) of ’ XE “’ tells us that it’s an indexing field. (Of course the double quote isn’t part of that—it goes with the one after the second w:proofErr element!)

Has anyone written anything to parse through this mess? Some OpenOffice coders have written something to parse the original Word doc file and they represent same index tag with this single empty element:

<text:alphabetical-index-mark text:string-value="collies" text:key1="dogs"/>

The primary and secondary terms are stored as separately addressable values in less than one-fourth the text that Word used for its markup, we don’t need to guess where the markup showing the index terms starts and ends, and as an added bonus, the single element containing this has the word “index” in its name. (And of course, OpenOffice didn’t create a misspelled “word” and then identify it as misspelled.)

In the Word version, the idea of curly brace field delimiters around the index markup brought up the specter of a ghost, and soon the ghost was hovering in front of me, moaning and rattling chains. To try to learn more about “XE” as a field instruction, I did a Google search for ‘word xml index xe “field instruction”’ (after several other fruitless searches) and the first of the only three hits was the file Word2007RTFSpec9.doc at http://download.microsoft.com: the current spec for the original nemesis of Word interoperability, “Rich” Text “Format”. (This spec didn’t help me much.) I’d often joked that WordML was just an XML version of RTF; now I recognize that it really is, at least for indexing markup. I’ll try to look at the good side: at least if you forget a single delimiter with WordML, loading the bad document won’t cause a freeze-up that requires a hard reboot of your machine, as it often did if you forgot a single curly brace in RTF generated by a script you were working on.)

If that sequence of w:r elements is Microsoft’s idea of a sensible standard for indexing markup, then they really don’t care about creating usable XML. Does anyone know of code out there that’s parsed Microsoft’s indexing XML to do something productive with it? My experiments all used Word 2003; has it been improved for the Word 2007 XML?

3 Comments

By Bryan on June 1, 2007 4:34 AM

I’ve been arguing this stuff for the past year. The main thing is that the construction of OpenXML is in such a way that surprisingly enough it has managed to produce a format that does not work well with any of the common stack of XML processing technologies: DOM, XSL-T, SAX are all hampered in one way or another by the design decisions of the format.

By Laurens Holst on June 2, 2007 1:03 PM

Isn’t putting human-readable language in attributes trouble? Think subscripts, superscripts, etc. That’s why having separate elements for them seems better to me.

By Bob DuCharme on June 2, 2007 4:01 PM

The question of storing a given piece of information in an element or an attribute is as old as SGML; see http://xml.silmaril.ie/developers/attributes/ for a summary. The convention in document-oriented XML (and an XML representation of a Word file is pretty document oriented!) is to put document content as PCDATA in elements and processing metadata in attribute values.

Regardless, if you spread information about a single construct across multiple elements, there should be a container element to show that they all go together, making it easier for processes to know when they’ve reached the the beginning and end of such a construct. That’s much of the point of XML: start-tags and corresponding end-tags to show where things begin and end.

blog

home

blog

categories

writing

music

about

Recent Posts

SPARQL queries of the Billboard Hot 100

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Querying for labels

Human-readable names in RDF

My brief tenor banjo career

Nicer date and time handling in SPARQL 1.2

Passing your own data to use in Wikidata visualizations