While interning at the Smithsonian Institution Archives this summer, I have had the opportunity to participate in archiving and preserving the Smithsonian's web presence. As a current Master's student specializing in Archives and Record management and Preservation of Information, I could not resist the prospect of learning an area of archiving completely foreign to me—web archiving. This endeavor has been both educational and challenging, and I would like to share some of the issues with web archiving that I have encountered.
In previous blog posts, we shared how websites are preserved in much greater detail than I will discuss here, but it is important to recount some of the information. We utilize an open-source software program called Heritrix to "crawl" websites, and what this means is that we capture how the site looks and functions on the day that we perform a crawl. After the capture is complete, we view the files on the Wayback machine in order to make sure that we have: A) captured an accurate visual representation of the actual "live" website and B) we can navigate the crawled site in the same way that we would the live site. This all seems easy enough, right? I thought so until I began viewing some of the captured websites on the Wayback machine. Heritrix and the Wayback machine capture and display most of our sites wonderfully, but as I began to crawl and assess more sites I noticed patterns of missing content on websites that contained a lot of interactive features and complex design layouts.
An example of a problematic site is the Smithsonian's Mobile website:
- Saving the Smithsonian’s Web, The Bigger Picture Blog, Smithsonian Institution Archives
- Archiving the Smithsonian’s Presence on the Internet, The Bigger Picture Blog, Smithsonian Institution Archives