Converting Word documents to DITA

Via OpenOffice and DocBook.

November 20, 2009

I recently had to convert a few Microsoft Word documents to DITA XML and thought it would be worth sharing my notes on the steps I took. To summarize, I opened each Word document with OpenOffice 3.1, saved it as a DocBook XML document, and then converted that to DITA with the XSLT stylesheet from a DITA plugin that I found. Images were a little more trouble, but at least I was able to eventually automate that part as well, dispelling my worries that I’d have to add all the image references to the DITA files by hand.

Word to DocBook

When you open a Word file with OpenOffice and do a Save As DocBook, it assumes that the document uses default Word styles, because that’s how OpenOffice knows what’s what in the document’s structure. The conversion does an impressive job of adding wrappers in the appropriate places considering that it’s using an XSLT 1.0 stylesheet. This kind of stylesheet would be much easier to write with XSLT 2, but that reduces the choice of XSLT processors that you can use. It doesn’t matter much from the user’s perspective, because it’s all under the covers anyway. The key thing is the convenience of creating the DocBook version from OpenOffice with a simple Save As.

On the down side, some nested bulleted lists in the original content did not show up in the DocBook version. I found this after converting the eventual DITA version of one of these documents to a PDF file with the DITA Open Toolkit and skimming through the original Word file and the new PDF to do a block-by-block comparison. (I strongly recommend this QA step if you’re doing this conversion with important content.) Many bulleted lists got converted to numbered lists as well, although I’m not sure if this was the fault of the Word to DocBook conversion or of a later stage described below. Another small issue is that when the original had more than one space character in a row, all but one got converted to hard spaces to maintain the spacing in XML. I just deleted all the hard spaces from the DITA version with a global replace, but you may want to keep them, depending on how the documents use them.

Typical Word users add space between paragraphs by inserting an extra carriage return, instead of adjusting the styles included with document, so your output from this conversion step might have a lot of empty para elements. You can delete this with a simple XSLT stylesheet or even a global replace in a text editor.

Adding the images

One annoying detail was that the DocBook files created by OpenOffice lack references to the images. When you save a Word file as an OpenOffice native odt (that is, zip) file, you can see that the content.xml file in there has simple, straightforward references to image files that are also in the zip file. The references look like this:

<draw:frame draw:style-name="fr1" draw:name="graphics63" 
  text:anchor-type="as-char" svg:width="6.8972in" svg:height="2.6264in" 
  draw:z-index="49"><draw:image 
  xlink:href="Pictures/10000000000003430000013EC16739CA.png"
  xlink:type="simple" xlink:show="embed" 
  xlink:actuate="onLoad"/></draw:frame>

(I had created the original image in the Word file by pasting it from somewhere else, so the conversion of each to a standalone png file was a nice bonus.) OpenOffice’s Save as DocBook feature doesn’t save these image references; the DocBook 4.1.2 version of the above that it creates looks like this:

<inlinegraphic fileref="embedded:graphics63" 
    width="6.8972inch" depth="2.6264inch"/>

(Note that DocBook 5 deprecates the inlinegraphic element.) After no luck tinkering with the sofftodocbookheadings.xsl stylesheet included with OpenOffice to create the DocBook file, I replaced its contents with an identity transformation to see what it was using as input. It turned out that it wasn’t using the original content.xml file mentioned above but some intermediary file that had replaced the xlink:href value above with a child element that stored the actual content of the image, like this:

<draw:image draw:style-name="fr1" draw:name="graphics63"
            text:anchor-type="as-char" svg:width="6.8972inch"
            svg:height="2.6264inch" draw:z-index="49">
  <office:binary-data>iVBORw0KGgoAAAANSUhEUgAAA0MAAAE+CAIAAADAgVy 
   <!-- lots more data here--></office:binary-data>
</draw:image>

At least the draw:name value of the draw:image element’s parent draw:frame element gets preserved in the DocBook output as the value of the fileref attribute, so instead of digging intp OpenOffice’s architecture to see what was preparing the input for sofftodocbookheadings.xsl and trying to fix that, I wrote a getImageNameData.xsl stylesheet to pull the {draw:name, xlink:href} pairings from the original content.xml file. Then, I wrote an addImageRefs.xsl stylesheet to look up the image filenames in the getImageNameData.xsl output and insert them into a new copy of the DocBook file.

DocBook to DITA

Eric Hennum describes a docbook2dita plugin for the DITA Open Toolkit in this posting on a DocBook list. My first attempt to use it from within the DITA Open Toolkit resulted in the errors discussed in a DITA group thread that ends with this posting from Mark Peters, who came up with a very simple solution: instead of running the conversion as a plugin, just call the XSLT stylesheet included with the plugin directly and tell it where your input is and where the output should go. The basic form of the command line that he shows worked for me.

Testing it

The first test to pass was whether the result was valid to a DITA DTD, and that went fine. The second test was the big one: whether the HTML and PDF created from the document by the DITA Open Toolkit looked right. In general it did, except for the issues described above, which showed that a block-by-block comparison of each PDF with the original Word file is worth the trouble. If I had to do a large amount of these conversions I’d dig deeper into the nested bulleted list and bulleted/numbered list issues in the hopes of reducing the need for this final manual step.

So far, though, the automation steps that I found or put together are definitely saving me tons of potential manual work. I only had to do this to a few documents, so I didn’t mind executing each step one a time, but if you want to use OpenOffice to convert a large amount of documents, I wrote something in XML.com called Moving to OpenOffice: Batch Converting Legacy Documents a few years ago that should help.

3 Comments

By David Kelly on November 20, 2009 12:47 PM

Bob,

While we haven’t documented it, Scriptorium did a web presentation not long ago on a similar conversion process using the same tools you use. I have put together an Ant script that controls the processing chain from beginning to end, and we also added a bursting script that takes the large DITA file output and creates individual topic files with a ditamap to hold them all together.

Some processing we do in Ant includes fixing Unicode by wrapping &#x and ; around the 4-digit code from the \unnnn instances in Word. Also, I wrote an XSL script that fixes autonumbering in the OO XML document before it gets converted to DocBook. It uses an identity transform with an exception that looks for this:

text:list[contains(@text:style-name,‘Outline’)]

In the transform, it keeps the descendant text:h tags and discards the text:list tags. Autonumbered sections in Word cause the Docbook-to-DITA script not to pick up the headings, so no topics are output.

I have used this process for 500-page Word documents, and it appears to be reliable, for the most part. Large tables slow it down considerably. Occasionally we run into Word styles that cause problems in the OO-to-DocBook conversion, so you are right, the results must be checked carefully. But as a conversion method, it sure beats cut and paste.

Glad to see that great minds think alike!\

By Bob DuCharme on November 20, 2009 12:52 PM

Thanks David! I wrote out my notes to help others who may try something like this in the future, and your comments will definitely be a further help to them.

By Jeroen Baten on February 12, 2010 3:52 AM

David, would you be kind enough to post the ant script itself? It would help me greatly in starting the toolchain!

blog

home

blog

categories

writing

music

about

Recent Posts

SPARQL queries of the Billboard Hot 100

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Querying for labels

Human-readable names in RDF

My brief tenor banjo career

Nicer date and time handling in SPARQL 1.2

Passing your own data to use in Wikidata visualizations