Generating MODS XML from RDF with Go templates

Using a built-in Go(lang) feature to drive an RDF application.

RDF, MODS, and Go logos

I had heard that Go (also known as “golang”) was an increasingly popular newish programming language before I migrated my blog from being generated by handmade XSLT scripts on snee.com to using the Hugo platform to generate it on bobdc.com. Hugo is written in Go, which was invented at Google (get it?) by three people, two of whom had contributed to the development of C, Unix, and important related technology at Bell Labs. Go provides an excellent basis for a website generation system because, although it prides itself on a fairly minimal core feature set, it provides templating of output with its standard libraries. As I wrote when I described the website migration, I never had to learn the programming language to get the website up and running, but I tweaked many Hugo templates to customized the website’s appearance.

This made me wonder whether Go and its templates would be a good way to generate content from RDF. Short answer: yes. After learning some Go I wrote a program that reads in triples, loads them into an appropriate data structure, and then hands that off to a template for output. Once I’d written the program, most of my work consisted of building up the template text file with little need to go back and tweak and recompile the Go code.

My demo project was to convert journal publishing RDF metadata into MODS XML. The Metadata Object Description Schema standard is hosted at the Library of Congress and is very popular for library metadata. It has its own RDF vocabulary, but a MODS to RDF Working Group decided to “consider a range of widely-adopted RDF namespaces, rather than pursuing a straight XML-to-RDF approach using the MODS RDF Ontology or proposing a new formal ontology”. This quote comes from their “MODS to RDF Mapping Recommendations” (pdf), which describes how to use Dublin Core, Library of Congress, schema.org, Europeana, and other RDF vocabularies to express MODS metadata.

This idea appealed to me because I see great potential in modeling relationships between the rich metadata standards of the publishing and library worlds in order to help people take better advantage of combinations of these standards. Using RDFS (or more high-powered modeling tools such as OWL or SHACL, but maybe just RDFS), a single system can more easily support multiple standards because it knows that if another system expects a BIBFRAME title but the host system stores book titles as Dublin Core titles, RDFS triples defining these as equivalent can help to automate the delivery of whatever the destination system wants.

In the publishing world, metadata delivery is more likely to be in XML because the content itself is often in XML. (Many people forget: that’s why XML was invented.) So, after RDF-based tools reap the benefits of the modeling described above, eventual delivery often needs to be in XML.

Shortly after I started this I described my plan for using Go templates to my daughter and she told me about the Jinja Python template library. Using that would have made this all much easier for me, because I already know Python and a nice RDF library for it, but I wanted to try it with Go specifically because templating is a standard part of the language as opposed to a community add-on library. (For embedding RDF values in templated XML, slicker proprietary alternatives to my novice Go coding are also available from MarkLogic and TopQuadrant.)

The goal

For converting semi-realistic publishing RDF to MODS XML I took this sample journal XML from the MODS website and then wrote out an RDF version of that metadata using the MODS to RDF Mapping Recommendations mentioned above. I copied all of these triples with a new subject and slight changes to their objects so that they could play the role of metadata about a dummy second document; this let me test whether my program could output MODS data for multiple documents. Finally, I used xmllint to validate the result against the MODS XML Schema to ensure that the result was valid MODS XML.

Writing and running the Go code

Many resources for learning Go are available. This tour was fine for me. I postponed reading How to Write Go Code because it looked to be more about large-scale systems, but I should have read it earlier on to better understand how to import the RDF library (or, in Go terminology, “package”) that I used.

Somewhere in the middle of this project I started reading the book The Go Programming Language, which was co-authored by Brian Kernighan—another Bell Labs alum with plenty of impressive UNIX-related accomplishments to his credit, including co-authoring the seminal book “The C Programming Language” with Dennis Richie. (I had to look up that book’s title just now because everyone has referred to it as “The K&R” since it was published in 1978.) I’d been considering a return trip to the K&R recently but don’t need to now because the Go book is more or less the modern version of that book for a modern version of C. The book’s website includes the complete tutorial chapter, and I highly recommend it. I am tempted to put this wonderful line from the tutorial on my blog’s template so that it shows up underneath all of my blog posts: “In the interests of keeping code samples to a reasonable size, our early examples are intentionally somewhat cavalier about error handling”.

Having written the original SQL page for the Learn X in Y minutes site, I should have thought to look at its Go page sooner. It’s a concise, handy resource that gives you a broad tour of the language quickly.

Go has clear roots in C. It’s easier, though, with no pointer arithmetic or malloc memory management to worry about. I was surprised at how often I wanted to make it do something I hadn’t done with it before and got it to work by the second try.

The standard Go packages include one for creating text templates and one for HTML templates. The HTML one includes some extra bits to protect against code injection and does not pass along <!-- HTML and XML comments --> in the template to the output. I didn’t notice any other differences and found the HTML one to be fine for generating MODS XML.

While Go packages are available to ease the querying of SPARQL endpoints, there is currently no Go equivalent of Python’s RDFlib, which has its own SPARQL engine. I used the knakk/rdf Go package to read the triples out of the disk files that provide my program’s input. (As a bonus, this package can read several different RDF serializations.) My program was really just a variation on the sample program that came with that package, so there are parts of my Go program that I don’t completely understand, but hey, it works. This package does have godoc documentation available.

This Yury Pitsishin blog post was a good way to get started with templates. You can define the templates within the Go source code but will more typically put them in a separate document. This offers the benefit of letting you tune the output without recompiling the conversion code. The Go code’s template definition uses the Template.ParseFiles method to specify the external file to use as a template, and then in the code the defined template’s Execute method passes along a data structure that has been populated in the program to use with the output template.

(Because this blog entry is getting long and my current Go skills are not something to show off, I’m not including my sample code here. You can find the Go code, template, sample input, and sample output on github.)

The data structure that my program passes to my template is a map of maps that I called docsMetadata because it stores the metadata for a set of documents. A Go map is like a Python dictionary, letting you store and retrieve things using a key. The docsMetadata keys are subject URIs—three different subject URIs used in a given docsMetadata instance would be specifying metadata for three different documents—and the things stored with them are maps whose keys are predicate URIs. Those keys give access to simple arrays (well, actually, “slices”, which are Go’s dynamic version of arrays) so that I can store more than one value for a given subject-predicate combination such as the following two triples from my sample input:

<https://example.org/objects/1> dce:subject "College librarians--Recruiting" .
<https://example.org/objects/1> dce:subject "College librarians--United States" .

Outside of the dce:subject values my demo rarely uses any slice entries beyond the first one. The MODS schema does allow multiple values for many of its metadata properties, so this made a simple way to store more than one publisher address, media type, or other property if necessary. Also, the knakk RDF package does let you check whether a triple’s object is a URI, a literal value, or a blank node, so I could have done some fancier RDF processing of those. As a proof of concept demo I thought it best to just treat them all as strings for now.

Developing the MODS XML template

You can put pretty much any text you want in a Go template. Nested pairs of curly braces store codes that give instructions to the compiled Go program about what to do with the data being passed in; often, this means inserting a particular component of that data structure. If the program passes an Employee data structure that has a Name field, then when the program sees “<p>Hello, {{.Name}}</p>” in the template it will replace the curly brace expression with the value of the Name field. If you edit that part of the template file to say “<p>Hello, {{.Name}} at {{.Address}}</p>” you can then run the program and see the new version of the output with no need to recompile the program.

Go’s templating language includes special keywords for tasks like conditional formatting. For example, if you don’t have Address values for all of the employees, you could format the phrase above to only include " at " and the address value if the there actually was an address value: “<p>Hello, {{.Name}}{{if .Address}} at {{.Address}}{{end}}</p>”.

You can see more of the special template codes at Golang Templates Cheatsheet. An important one for my MODS project was range, which lets you iterate over multiple values for a given property. I used it for the dce:subject values mentioned above and also to enclose nearly the whole template so that I could output metadata about multiple journal documents. Because I was passing a map of maps to the template, referencing just the right bits was not as simple as pulling a Name value out of an Employee data structure, but it wasn’t too bad.

One downside to working with these templates is the cryptic error messages caused by template problems. A missing curly brace could lead to an error message of “panic: runtime error: invalid memory address or nil pointer dereference” with no line numbers about the template problem or other helpful information. Instead of celebrating Brian Kernigan’s cavalier approach to error handling in examples I should probably dig further into Go’s facilities for that.

Running it

The knakk sample program uses an in command line parameter to indicate the input file, so mine does too:

rdf2modsxml -in modsjournals2.ttl > modsjournals2.xml

The output had a lot of extra blank lines, which ultimately don’t matter in XML, but I sometimes ran the program like this to remove them:

rdf2modsxml -in modsjournals2.ttl | awk 'NF > 0' > modsjournals2.xml

(Raise your hand if you know what the “k” in “awk” stands for.)

Was it valid MODS XML? As I mentioned above, I used xmllint (which seems to be part of most standard Linux distributions now and can be downloaded for Windows or MacOS) to validate the result against the MODS XML Schema.

xmllint --schema mods-3-4.xsd modsjournals2.xml --noout

The --noout parameter tells xmllint not to show any of the content and to just indicate whether the XML document conforms to the schema or not. The output of my rdf2modsxml program did conform.

Go and RDF and publishing and library metadata

If the development of a useful new tool requires the writing of code that imports libraries and then needs to be compiled to a binary version, that can be asking a bit much of people who are not full-time software developers. If that code is fairly simple (with a package to do the most difficult part already available) and the main work of using the tool consists of just editing a separate text file, then I think that the use of Go templates for RDF application development offers some real promise. With Hugo as a model, this could obviously be done to use RDF data in applications destined for browsers; I was especially happy to see that it works to generate XML that conforms to an important standard unrelated to HTML.

I’m also going to start being braver about messing around with the Hugo templates used to generate this blog!