By Parker Phinney
This spring, I worked as a Dartmouth Digital Studies Fellow on the Usable Images project and the MetadataGames project.
The Problem=====
Several United States Federal agencies maintain repositories of public domain images. These repositories have thousands of images with practical uses. For example, a health sciences professor creating a textbook to share for free on line is greatly helped by the public domain (free of copyright restrictions) images available for free on the Center for Disease Control’s website, which range from photographs of disease symptoms, to microscope slides, and beyond.
However, these various US government public domain image repository websites are isolated silos. There is no way to easy way to search across all of them, and many users will not think to visit these websites when they are looking for public domain images. These images are a valuable but under-utilized resource.
The Project====
The aim of Usable Images is to make images from US federal government agencies easier to find and use for various purposes. In particular, this includes:
** Cross-posting the images to places where people commonly look for public domain images
I’m currently working with the Wikipedia community to ensure that Wikimedia Commons contains an up-to-date mirror (updated nightly) of the images in as many federal agency websites as possible.
** Providing an up-to-date mirror with semantic markup
Images.freeculture.org provides an up-to-date mirror of the images and metadata which uses semantic markup (RDFa) to articulate the metadata in a way that makes search engines more likely to return them in the right cases, including when a user searches for images based on license.
** Making the image metadata accessible through a machine interface
My mirror website makes the data available on-demand through a simple JSON API, as well as a sqldump of the whole dataset.
I’m working with the Metadata Games Project, which is making it easy for the crowd to add metadata tags to images by playing games. My focus will be on making it easy to add an image set to metadata games and to dynamically retrieve the metadata tags in a machine-readable format.
The Python Library====
Additionally, the Usable Images Project builds and maintains the usable_image_scraper python library, which provides easy access to the above functionality. The goal of the usable_image_scraper library is to require the fewest possible lines of code to add a new federal agency repository website, providing a machine interface to metadata, cross-posting, mirror website hosting (including semantic markup and a simple API), and extensive tests right out of the box.
The usable_image_scraper is licensed under the GPLv3 or later. Contributions are greatly appreciated! If you have some python web scraping experience, adding an image repository website can be done in an afternoon, especially if you use the tests that I’ve written to guide your development. Email me for more info on getting involved!
Additionally, as part of my Digital Studies fellowship at Dartmouth College, I made several contributions to the metadatagames project, including adding a simple API which allows image tags generated by Metadata Games to be easily added to a different websites (such as an image search engine or reference page). I’m currently in the process of uploading my images to the usable images project website. Once that’s done, augmenting the image metadata maintained by the Usable Images Project with tags generated by the crowd with Metadata Games will be trivial.
Technical Difficulties and Lessons Learned====
The largest difficulties encountered in this project have to do with the fact that my data collection used web scraping techniques.
One unexpected difficulty was the Fish and Wildlife Service’s extensive list of metadata fields. By default, fields with no data were not displayed, and fields either did or did not contain data seemingly at random (for example, some images have a “language” field that is set to english, and some don’t have that field at all). I thought I would have to write a separate scraper just to get the full list of metadata fields to build in to my database, but luckily an email request to the Fish and Wildlife Service was quickly answered with the information I needed.
In general, a difficulty with using web scraping to extract a dataset from a website is that it’s easy to write a scraper that is more complicated and/or fragile than necessary. For example, our first iteration of the scraper for the Center for Disease Control’s Public Health Image Library used a complicated process to fake form requests and step through multiple pages in order to impersonate a user and avoid session errors. However, we later learned that in fact the CDC’s PHIL has permalinks that have no session requirements. These were not documented on the website anywhere, but some googling turned them up after we had already build our over-complicated scraper.
I’ve developed a mental checklist of things to look for when designing a scraper. This includes RSS feeds (even undocumented ones–google can be used to dig them up), permalinks (same story for these), META tags, and of course, existing scrapers written by members of the Wikipedia community and others.
Another thing that I learned, cliche as it is, was to ask for help. Some of the organizations and websites that I reached out to never got back, but the Fish and Wildlife Service was great about pointing me in the right direction with their website. For example, one feature that each of my scrapers needs is to figure out the highest identifying number (ID) in the image database. This is checked against the highest ID in my local mirror in order to figure out which new IDS need to be scraped out, if any. For the CDC and FEMA repositories, one way to accomplish this is to issue a blank search and grab the first result. However, the FWS doesn’t return results in decreasing order of ID. I planned a complicated solution where the scraper would make a reasonable guess at the highest ID, and use binary search to zero in on the correct answer (a fragile technique, because large swaths of IDS are missing from the FWS image repository). However, I was delighted to learn over email that a developer at the FWS was building a prototype RSS feed which I could easily use instead.
This example of collaboration is an important reminder that web scraping isn’t necessarily subversive. One great thing about building a scraper to extract a dataset is that the providers of the dataset don’t need to devote resources to developing an API. As long as servers aren’t crushed by scrapers, everybody wins–especially when data providers are enthusiastic about helping scrapers to spread and augment their data.
A final difficulty is that cross-posting often requires certainty. For example, I’ve long talked about uploading of zip of my dataset to archive.org. However, small bits of self-consciousness keep stopping me. What if I discover another error in one of my scrapers, and learn that a large set of my metadata is poorly formed? I want to upload to archive.org, but I just want to finish this /one last feature/ first, I keep telling myself. A similar difficulty arises with cross-posting to Wikimedia Commons. If I discover that a part of my dataset has a flaw or that I have some great additional data to add, it’s bad form to recklessly re-run my upload script because it may overwrite updates made by members of the community. Flickr Commons is (I /think/, because of their API) an example of a website that I could cross-post to such that I could easily run a script to update image metadata. However, they haven’t responded to my emails.
Credit====
The MetadataGames project is a research project by faculty member Mary Flanagan at the Tiltfactor Lab, Dartmouth College. The Usable Images project is a continuation of the “Release Our Data” project, started by Seth Woodworth and me in winter 2010. I’ve had indispensable advisorship from academics and library scientists, Free Culture community members, Wikipedia community members, and beyond. In particular, thanks to Professor Flanagan for advising my Digital Studies Fellowship, Sergey Bratus for advising an independent study on this project last summer, and Asheesh Laroia of OpenHatch, Seth Woodworth of OneVille, Aaron Swartz of Demand Progress, Mike Linksvayer of Creative Commons, Jeremy Baron and Maarten Dammers of the Wikipedia community, and others for guidance and help along at various points along the way. I’m not done yet!