Digitization and its discontents

How sloppy is OK for Google scans?

January 24, 2008

Every now and then someone finds a page image from Google Book Search that shows the scanner operator’s hand over the page or something else that should be equally embarrassing. If finding such images was a game, Bill Trippe would have the high score. Dale Dougherty recently pointed out the hilarious example shown here. (It took me a while to realize that the little flashes of color are the rubber fingertips that editorial workers often use to help them turn pages faster.)

One commenter wrote that “after this, their QA department must be in trouble”. To paraphrase my reply, unless you’re a piecemeal nonprofit—which Google certainly isn’t—part of the project planning for any large-scale digitization effort is deciding what level of quality you’re going to achieve and then putting the QA infrastructure in place to attain it. 100% accuracy isn’t possible, but 99.95% and 99.995% are. Of course, 99.995 means a more expensive QA infrastructure, and your budget and schedule are two important inputs when deciding on an accuracy figure to aim for. At my place of employment I have co-workers who have this measurement down to a science; contact me if you’re interested in hearing more.

Google’s QA department is only in trouble if they’re not meeting their accuracy goal. I’d love to see what this figure is, but I’m not holding my breath waiting for them to reveal it. I’m guessing that it’s lower than 99.95%.

The same day that Dale posted this I learned from Robin Cover’s XML Daily Newslink that the National Library of the Netherlands had published a report titled The Current State-of-art in Newspaper Digitization: A Market Perspective in D-Lib magazine, an online magazine focused on digital library research. Newspaper digitization presents all the difficulties of book digitization and a few more:

The pages are bigger, so you need larger (and more expensive) scanners to make page images.
A single page can have many articles, and many of these—especially those that begin on the front page—often end on a different page. The D-Lib article calls these “so-called ‘continuation’ articles”, and newspaper people call the continuations “jumps”.

In addition to the extra work of assembling the pieces so that the digital versions of these articles are coherent wholes, this presents some new metadata problems: if a keyword search finds a phrase on page 31, but it’s in the jump for an article beginning on page 12, will you retain this information when you assemble the pieces, and if so how will you store it and present it to the user?
Knowing that most people will throw out their newspapers within days, newspaper publishers save money by printing them on cheap paper. The high acid content of newsprint means that it gets brown, crumbly, and more difficult to OCR much faster than paper used in other kinds of publishing.

The D-Lib article covers these and more generalized digitization issues very well. I recommend it to anyone interested in what goes into such a project, whether it’s the Google Book Search project or one you’re considering yourself.

1 Comments

By John Cowan on January 24, 2008 11:27 AM

I don’t have inside knowledge about this, but on the evidence of one case, Google Books will yank a scanned book if there are fingers visible even if they don’t interfere with the comprehension of the text. However, they do seem to depend mostly on complaints by others (a reasonable attitude IMHO – Google *is* behaving like a nonprofit in its book-scanning activities).

blog

home

blog

categories

writing

music

about

Recent Posts

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Querying for labels

Human-readable names in RDF

My brief tenor banjo career

Nicer date and time handling in SPARQL 1.2

Passing your own data to use in Wikidata visualizations

Entity recognition from within a SPARQL query