Reclaiming my picture metadata from flickr

Surprise: by converting multiple sources of data to triples and then running a SPARQL query.

...a pretty nice example of how triples and SPARQL can make quick and dirty data integration easy even when the data in question isn't necessarily stored as triples.

We should give flickr some credit for providing an API that lets us download the metadata we’ve entered about our pictures (for example, titles, descriptions, and membership in custom sets such as XML Summer School 2011 or Artsier Stuff) but that metadata all refers to pictures on flickr’s servers. What if I want to use blurb.com to print a hardcopy album of one of these sets? Do I have to download that set’s pictures from flickr, even though I already have them on a hard disk, because I don’t know which ones on my hard disk correspond to the ones in that set on the flickr server?

As it turns out, no. The general question is this: how do I connect metadata that I’ve entered on flickr.com with the files on my local hard disk? Assuming that I never took two different pictures in the same millisecond, I can use the date-time stamp stored inside of each JPEG image file as a unique ID (or, in more OWLish terms, as an inverse functional property, although I didn’t actually use owl:InverseFunctionalProperty anywhere and just let SPARQL do the work), so here’s what I did:

  1. I used the flickr API to download the metadata about all the pictures that I have stored there, including set membership. This data was all in XML, so I then used some XSLT to convert that to Turtle RDF.

  2. I used Apache Tika (an open source toolkit I’ve written about here before) to pull out metadata about all the pictures on my hard disk as JSON. (I could have asked Tika for XMP, which would give me RDF, but asking for JSON gets you more data.) I then used some JavaScript to convert this JSON to Turtle RDF. For the file \My Pictures\2012-01-12\IMG_5907.jpg, I created a IMG_5907.jpg.ttl file where the subject of all the triples is the URI http://www.snee.com/bob/pics/id/2012-01-12/IMG_5907.jpg.

  3. I loaded all this RDF into a triplestore and then ran the query shown below, which (in this case) showed me the URIs for the image files on my hard disk that corresponded to each picture stored in my flickr “Artsier Stuff” set:

PREFIX dc:   <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bf:   <http://snee.com/ns/flickr#>
PREFIX exif: <http://www.w3.org/2003/12/exif/ns#> 
SELECT * WHERE {
  ?ps a bf:Photoset ;
      dc:title "Artsier Stuff" ;
      rdfs:member ?memberPic .
      ?memberPic dc:title ?picTitle ;
      bf:dateTaken ?flickrDate. 
   OPTIONAL { ?diskPic exif:dateTimeOriginal ?flickrDate . }
}
ORDER BY ?flickrDate

The query finds the bf:dateTaken value of each picture from that set, then looks for a local disk file with that same date-time stamp. I put that last bit in an OPTIONAL pattern because I wasn’t sure whether it would successfully find local versions of all the files, and wanted to see which ones it had trouble with. As it turned out, it didn’t have trouble with any of them, which was great to see.

Finding those URIs was handy for gathering up local copies of pictures from a given set. Other queries could retrieve the title, description, and other data associated with any set of flickr pictures and show the disk files that they went with.

The whole thing was a nice example of how triples and SPARQL can make quick and dirty data integration easy even when the data in question isn’t necessarily stored as triples. As an added bonus, the metadata remains meaningful even if I stop paying my subscription fee to flickr and lose access to metadata for all but 200 pictures, which is what happens when you scale back to a free Flickr account.