The Bigger Picture: Visual Archives and the Smithsonian
- How photos from the Smithsonian’s Archives of American Gardens help preserve the memory of gardens (such as the Middlegate Japanese Gardens pictured above) that are now gone.
- The Museum of the Future has a great roundup of videos and blogs about museums, technology, and media.
- An update on earthquake damage at the Smithsonian, and hear Smithsonian Secretary (and earthquake expert) Wayne Clough speak about the earthquake.
- And an update on the Smithsonian’s Haiti Cultural Recovery Project—a project in which Smithsonian experts are are helping to restore Haitian artwork, artifacts, documents, media and architecture that were damaged in the earthquake there.
- A very interesting post from the American Social History Project Blog looking at how photographs were distorted in their translation to engravings used to illustrate 19th century newspapers, demonstrating “the discrepancies between photographs and their adaptations into mass-produced formats”. (In this case the translation of a Civil War photo to a newspaper illustration poignantly expresses racial biases of the time period.)
- How an important letter written by President Lincoln after the Battle of Antietam, and then stolen from the War Department records, was recently returned to the National Archives:
"Missing Lincoln Documents returned to National Archives," The letter and Lincoln's endorsement had apparently been removed from Edwards' Commission Branch file at some unknown time in the past, perhaps when the records were still in the custody of the War Department. Bill Panagopulos of Alexander Auctions, Inc., when informed the documents were part of a file at the National Archives, agreed to return them. Courtesy of the National Archives YouTube Channel.
This post is an update to Lynda Schmitz Fuhrig's post “Archiving the Smithsonian’s Presence on the Internet” from September 2, 2010.
The Smithsonian Institution has had a presence on the Internet for more than sixteen years. It’s come a long way since then. Documenting the Smithsonian’s various websites falls under the purview of the Smithsonian Institution Archives...but how do we do it?
As a web preservation intern at the Archives this summer, I’ve helped to develop the workflow for preserving Smithsonian-affiliated web content. Our goal is to take an annual “snapshot” of all Smithsonian public websites to be kept in the Archives.
Why do we preserve websites?
Institutional websites are important to preserve because they are:
- records of institutional activity;
- publications exposed in the public sphere; and
- artifacts of historical and heritage value.
(Adapted from PoWR: Preservation of Web Resources Handbook, JISC, 2008.)
How do we preserve websites?
While each unit or office within the Smithsonian maintains and backs up the web content they create, the best way for the Archives to get a comprehensive snapshot of all the websites as they appear online is to use a web crawler. Crawlers, or spiders, are programs that browse the Internet by following trails of links, typically to index or save the content they encounter.
We use Heritrix, open-source crawling software developed by the Internet Archive, to conduct focused captures of individual websites according to our specifications and schedule. Heritrix bundles all the web content it crawls into .WARC files, an archival file format
We need special software to view the content of the WARC files and perform a quality control check to make sure everything looks right and nothing is missing. We’re using the Wayback application, also developed by the Internet Archive. The local application looks and acts just like the Wayback Machine online. Once we’re satisfied with the captured website, we accession the WARC files and they’re officially part of the Archives’ holdings.
Future researchers will also have to use Wayback or other WARC-reading software to view preserved web collections. They might be interested in the content of web-published news releases, the structure of the Smithsonian’s extensive online image collections, or what was deemed worthy of a blog post (!).
The road to web preservation is not without a few bumps. A few issues we’ve encountered are:
- Estimating the size of site. Seemingly small, innocuous websites can actually contain many thousands of documents. One of the largest single crawls so far was the website of the National Museum of Natural History’s Botany department, which took 49 hours and 57 minutes to capture 78,922 files. To budget our time, we need to estimate how big a website is, and we use specific software tools like link validation programs to do that.
- Deciding what external content to capture. How do you tell a web crawler that you want it to follow a link in a blog post to a useful article elsewhere on the Smithsonian website, but not to follow a link to a spam site in the comments? For blogs, we configure Heritrix to accept embedded off-domain content, like photos from Flickr, but not to scrape linked off-domain sites. For non-blog Smithsonian sites, we don’t capture any off-domain content at all. In both instances, we can also specify any URL patterns that are acceptable.
We’re still learning how best to use these tools to fit the needs of the Archives, and in the past two months, we’ve made a lot of progress:
114 crawls performed
541 hours of crawling
684,264 pieces of content captured (includes HTML pages, JPEG images, MP3 audio, etc.)
That means that so far, we’ve reached about two-thirds of this year’s snapshot goal.
If you happen to follow the Smithsonian’s Flickr Commons stream very closely, you may have noticed that two new sets of photos were uploaded last week: a set from thePacific Ocean Biological Survey Program, as well as a set of Field Book Lantern Slides.
While the name may sound dry, the biological survey photos, as you can see above, are full of strikingly beautiful gems—abstract patterns of frigates fluttering across the horizon off the coast of the Phoenix Islands, and elegantly curved bird profiles. The photos document a biological survey of plants and animals of the Pacific completed by Smithsonian employees during the 1960s and 70s.
And the Field Book Lantern slides above are a series of image slides used by researchers to present their work to colleagues and the general public. They include some especially colorful slides documenting the Smithsonian-Roosevelt African Expedition 1909 (and the “specimens” they collected), as well as an incredible series of early 20th century slides of the preparation and installation of dinosaur specimens and other mammals from the Smithsonian’s Division of Vertebrate Paleontology.
Both sets of photos come from our collections at the Archives, and are a part of the the Field Book Project—a joint venture of the National Museum of Natural History and us, the Smithsonian Institution Archives—to create one online location for scholars and others to search for field books and other field research materials. Summer interns for the Field Book Project curated both sets and write in detail about their content on the Field Book Blog. Read more in their post, “On Land and at Sea: Two Intern Flickr Sets on The Commons.” You can follow the progress of the project on the Field Book blog.