The Bigger Picture: Visual Archives and the Smithsonian
Connecting the Dots: Issues with Preserving Complex Websites
While interning at the Smithsonian Institution Archives this summer, I have had the opportunity to participate in archiving and preserving the Smithsonian's web presence. As a current Master's student specializing in Archives and Record management and Preservation of Information, I could not resist the prospect of learning an area of archiving completely foreign to me—web archiving. This endeavor has been both educational and challenging, and I would like to share some of the issues with web archiving that I have encountered.
In previous blog posts, we shared how websites are preserved in much greater detail than I will discuss here, but it is important to recount some of the information. We utilize an open-source software program called Heritrix to "crawl" websites, and what this means is that we capture how the site looks and functions on the day that we perform a crawl. After the capture is complete, we view the files on the Wayback machine in order to make sure that we have: A) captured an accurate visual representation of the actual "live" website and B) we can navigate the crawled site in the same way that we would the live site. This all seems easy enough, right? I thought so until I began viewing some of the captured websites on the Wayback machine. Heritrix and the Wayback machine capture and display most of our sites wonderfully, but as I began to crawl and assess more sites I noticed patterns of missing content on websites that contained a lot of interactive features and complex design layouts.
I began doing some research on the problematic sites by looking at their source codes and it became clear that sites that contain a lot of JavaScript and Adobe Flash content do not capture properly. Flash can be problematic during playback, but JavaScript is even more frustrating because sometimes the content captures, other times it does not. I began to wonder if this was merely a configuration issue in Heritrix or possibly the Wayback machine, but apparently this problem is plaguing web archivists everywhere.
An example of a problematic site is the Smithsonian's Mobile website:

The Smithsonian's Mobile website contains a large amount of JavaScript in the source code:

This is the result of the JavaScript-heavy site crawled and viewed in the Wayback machine:

After spending some time reading articles and blogs about Heritrix, it seems that there is not any specific way to capture this content. The Heritrix configuration settings allow us to capture minimal JavaScript and Flash content, and for some sites this is enough. However, there is a significant amount of missing content, and as of now, there is nothing we can do about it. The most interesting (and frustrating) part about this dilemma is that if you check the Heritrix content logs, all of the JavaScript and Flash content is there, but they will not link up to produce working content. Therefore, all the "dots" are there, but they can't be connected.
The online exhibition for Roberto Clemente is an example of a site with a small amount of JavaScript and Flash; this is how it appears on the Internet:
Here is the same site as it appears on the Wayback machine. The crawl was successful due to minimal JavaScript and Flash content:

Is all hope lost for these websites? Luckily, no! Web developers and web archivists acknowledge that this is a significant problem and are currently working toward developing programs that are specifically designed to capture JavaScript and Flash content. We also hope some software and hardware upgrades will help, too. It's an exciting time to be a budding web archivist and I can't wait to see how technological advancements such as these enhance current web archiving practices.
Related Resources
- Saving the Smithsonian’s Web, The Bigger Picture Blog, Smithsonian Institution Archives
- Archiving the Smithsonian’s Presence on the Internet, The Bigger Picture Blog, Smithsonian Institution Archives
- Crawling JavaScript, Internet Archive
Comments (3) – Leave a comment
Indeed, the fact that some sites were created several years ago is challenging. The Roberto Clemente website, for example, was published in 2005, when Flash was THE tool for bringing interactivity to the web.
The trick for web historians will be how to see and understand this kind of content in spite of the technology that's used to deliver it.
Robots exclusion standard: The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable.
http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
If a webpage you want to crawl excludes the Archive-It robot, you should try contacting the webmaster for the site, letting him or her know why you want to archive their site, and request that they allow the Archive-It robot to crawl their site.
The webmaster will need to know the name of the Archive-It robot (or crawler): It is archive.org_bot
The Heritrix web crawler is set to very polite settings for all Archive-It crawling and should not impact the site being crawled in anyway.
Leave a comment
Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.
About
Smithsonian on Flickr Commons
Topics/Tags
- See Here (611)
- American History (542)
- Science (429)
- Archive (329)
- Cities/Places (277)
- Exhibitions (234)
- Web/Tech (210)
- Photo History (189)
- Link Love (153)
- Politics/Government (153)
Blog Roll
Categories
- Collections in Focus (988)
- What Gets Saved (337)
- Behind the Scenes (212)
- Smithsonian History (134)
Monthly Archive
- May 2013 (20)
- April 2013 (26)
- March 2013 (26)
- February 2013 (26)
- January 2013 (28)
- December 2012 (26)
- November 2012 (28)
- October 2012 (32)
- September 2012 (26)
- August 2012 (31)
- July 2012 (26)
- June 2012 (27)
- May 2012 (27)
- April 2012 (27)
- March 2012 (28)
- February 2012 (27)
- January 2012 (26)
- December 2011 (31)
- November 2011 (28)
- October 2011 (35)
- September 2011 (31)
- August 2011 (35)
- July 2011 (41)
- June 2011 (43)
- May 2011 (33)
- April 2011 (40)
- March 2011 (43)
- February 2011 (35)
- January 2011 (36)
- December 2010 (42)
- November 2010 (40)
- October 2010 (44)
- September 2010 (37)
- August 2010 (39)
- July 2010 (38)
- June 2010 (37)
- May 2010 (42)
- April 2010 (44)
- March 2010 (47)
- February 2010 (40)
- January 2010 (39)
- December 2009 (43)
- November 2009 (34)
- October 2009 (11)
- September 2009 (11)
- August 2009 (12)
- July 2009 (14)
- June 2009 (10)
- May 2009 (12)
- April 2009 (14)
- March 2009 (10)
- January 2009 (1)
