Linking linked data to U.S. law

Automating conversion of citations into URLs.

At a recent W3C Government Linked Data Working Group working group meeting, I started thinking more about the role in linked data of laws that are published online. To summarize, you don’t want to publish the laws themselves as triples, because they’re a bad fit for the triples data model, but as online resources relevant to a lot of issues out there, they make an excellent set of resources to point to, although you may not always get the granularity you want.

Plenty of government data references laws and related materials.

I’m discussing U.S. Federal law here, but similar principles should apply both in individual states and in other countries. The main sets of laws here are legislation, code, regulations, and court decisions. (“Code” refers to laws passed by legislation, arranged by topic; for example, laws passed about taxes are gathered into the Internal Revenue Code.) If you really want to learn about the various forms of legal material and their relationship, I highly recommend the book Finding the Law, which I found indispensable when I worked at LexisNexis.

Most law consists of narrative sentences arranged as paragraphs, often with metadata assigned to certain blocks of it. It’s such a good fit for XML that legal publishers were among the first users of XML’s predecessor, SGML. (Their use of XML and SGML account for a large chunk of my career, and I know that some old XML friends like Sean McGrath and Dale Waldt continue to make great contributions in this area.) So, while you wouldn’t get much benefit splitting these sentences and paragraphs into subjects, predicates, and objects and publishing them as triples, plenty of government data references laws and related materials, and it’s more helpful if they can reference them with URLs that lead to the actual laws. To add these URLs with any kind of scalability, you need to find out the common format for citing a document (or, if possible, a point within a document) and an online source of those legal documents whose URLs can be built from that citation format with a regular expression or some other automated tool.

When creating links to any specific bits of U.S. law, the most valuable book is The Bluebook: A Uniform System of Citation. As the subtitle implies, the book describes the normalized way to refer to legal documents and their components. Once you know these, a regular expression can often turn them into a URL that leads a browser right to the part you want. For example, while people often refer to the Supreme Court case outlawing school segregation as “Brown v. Board of Education”, its official name is “347 U.S. 483”, which means “the case beginning on page 483 of volume 347 of the official publication of U.S. Supreme Court decisions”.

While there are several sites hosting Supreme Court decisions out there, notably Cornell Law School’s Legal Information Institute, the one whose URLs are easiest to construct from a proper Supreme Court citation are at justia.com, where the URL for Brown v. Board of Education is http://supreme.justia.com/us/347/483/case.html. (See also my favorite case, Campbell aka Skyywalker et al v. Acuff Rose Music, Inc. at http://supreme.justia.com/us/510/569/case.html. Make sure to listen to the relevant work on YouTube while you review it.) If you’re really interested in linked data and U.S. Supreme Court cases, DBpedia has lots of great metadata for many important cases, as I wrote about in Court decision metadata and DBpedia.

To create a URL for other U.S. court systems, you’ll have to look up the proper way to cite them in a resource like the Bluebook and then look for versions of that court’s cases online with URLs that reflect the citation in a manner that lets you automate the creation of the URL. This is a theme for linking to any kind of law on the web, and you can be sure that developers at the Legal Information Institute, LexisNexis, WestLaw, and other legal publishers have put plenty of time into developing regular expressions to make this happen so that they can turn plain text citations into hypertext links. (It would be great if the LII made their regular expressions public. LexisNexis and WestLaw never would, although they’re more interested in keeping such proprietary work away from each other than from us.)

Legislation can be more complicated, but two excellent resources make it remarkably simple: the Library of Congress’s THOMAS system lets you create persistent URLs for legislation using the handle system (see also its inventor’s web page on it), which I hadn’t heard of before the Government Linked Data meeting. The Law Librarian Blog has a nice entry showing examples of how to use it. LegisLink is another way to link to legislation, and looks simpler to me. A Legal Information Institute blog entry has a good explanation of this, and LegisLink provides an excellent form to construct the URLs. These even let you construct links to a specific section of a piece of legislation.

Granularity is an even bigger issue when linking to code and regulations, which are often broken down into numbered and lettered pieces of pieces of pieces. Ever since I worked at the grandly named Research Institute of America (a publisher of hyperlinked U.S. tax law and related information), it’s always irked me to see people refer to a pension plan as a 401K, because as subsection k of section 401 of the U.S. Tax Code (title 26 of the U.S. Code), it’s more properly written 401(k), or, to use its full name, 26 USC 401(k). The Government Printing Office lets you you link directly to section 401, if not subsection k, with the URL http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&docid=Cite:+26USC401, and the LII lets you link to it with http://www.law.cornell.edu/uscode/26/usc_sec_26_00000401----000-.html.

That’s the US Code, which arranges the laws by topic. Regulations are arranged by topic in the CFR, or Code of Federal Regulations. For example, the legal definition of bourbon is in title 27 of the CFR (Alcohol, Tobacco Products and Firearms), Part 5 (Labeling and Advertising of Distilled Spirits), section 22 (The standards of identity), subsection b (Class 2; whisky) subsubsubsection (1)(i). The full citation would be 27 CFR 5.22(b)(1)(i), but I know of no way to link to anything more specific than 27 CFR 5.22: http://edocket.access.gpo.gov/cfr_2010/aprqtr/27cfr5.22.htm. (Bookmark that on your phone’s browser and then bet a Maker’s Mark with the next barroom loudmouth that you hear insisting that bourbon must legally be made in Bourbon County, Kentucky. He’s wrong. It can be made anywhere in the United States.)

As you can see, there’s some work involved in creating URLs for links to laws, but research for this blog entry led me to new resources like LegisLink that I hadn’t heard of before, so I encourage you to let me know if there’s anything important that I’m missing.

It was also interesting to see that the LII is involved in efforts to create an international standard for legal document URIs proposed by some Italian legal researchers. (This is particularly interesting when you consider that Italian legal researchers basically invented the concept of linking 900 years ago.)

A comment from Frank Bennett of Nagoya University’s Faculty of Law:

These are indeed important developments. The systematic linking of case law and statutory data promise to have a large and positive impact on our access to legal resources. The only point I would take issue with is the reliance on Bluebook citation forms as the rosetta stone for identifying resources. Parsing cites out of plain text is a necessary kludge, given the general absence of meaningful structured metadata from online legal resources (thank you Lexis, thank you WestLaw), but it should be recognized as a kludge.

To get a lively set of service layers running on top of legal data, the metadata contained in or relevant to a particular case, statutory provision or regulatory provision needs to be readily accessible to calling applications. While it is true that string parsing machinery can be written to a good standard, assuming perfectly regular citation forms and uniform document formats, neither of those constraints applies in the wild. The Bluebook shares the field in North America with the ALWD and the McGill Guide. To make matters worse, the Bluebook specifies citation forms for some foreign legal resources that vary significantly from the native citation forms of the target jurisdictions. Document formats vary as well, so getting an accurate string parse may require special-purpose serialization of the document before applying a string parser to the text – which may be hundreds of pages in length. Although certainly better than nothing, string parsing is a fragile strategy that would be very cumbersome to standardize and does not scale well.

Matching rendered cites to URLs is an important prospect, but we won’t see significant progress at the application level until the intervening step of producing true structured metadata – and embedding it in our online resources – is covered.

A comment from Augusto Herrmann:

I just read your interesting article intitled “Linking linked data to U.S. law”. I’d like to point you to a quite successful government project that uses URN for Brazilian legislation. The portal where you can search for legislation is at http://www.lexml.gov.br and information about the project can be found on http://projeto.lexml.gov.br . There you can find the document “Parte 2: LEXML URN” which describes the rules to construct official URN for legislation and court decisions (it’s in portuguese, though). The project started circa 2004 and closely followed the footsteps of the Italian Norme in Rete project. If you aren’t yet familiar with it, it’s worth a look (see also akomantoso.organd metalex.eu).

(Note on comments: after turning off comments on this blog for a few days because of comment spam, turning them back seems to have no effect. If you send me an email about what I’ve written at snee.com (bob), I’ll add it and any response here.)