Connecting the Dots: Issues with Preserving Complex Websites

While interning at the Smithsonian Institution Archives this summer, I have had the opportunity to participate in archiving and preserving the Smithsonian's web presence. As a current Master's student specializing in Archives and Record management and Preservation of Information, I could not resist the prospect of learning an area of archiving completely foreign to me—web archiving. This endeavor has been both educational and challenging, and I would like to share some of the issues with web archiving that I have encountered.

In previous blog posts, we shared how websites are preserved in much greater detail than I will discuss here, but it is important to recount some of the information. We utilize an open-source software program called Heritrix to "crawl" websites, and what this means is that we capture how the site looks and functions on the day that we perform a crawl. After the capture is complete, we view the files on the Wayback machine in order to make sure that we have: A) captured an accurate visual representation of the actual "live" website and B) we can navigate the crawled site in the same way that we would the live site. This all seems easy enough, right? I thought so until I began viewing some of the captured websites on the Wayback machine. Heritrix and the Wayback machine capture and display most of our sites wonderfully, but as I began to crawl and assess more sites I noticed patterns of missing content on websites that contained a lot of interactive features and complex design layouts.

I began doing some research on the problematic sites by looking at their source codes and it became clear that sites that contain a lot of JavaScript and Adobe Flash content do not capture properly. Flash can be problematic during playback, but JavaScript is even more frustrating because sometimes the content captures, other times it does not. I began to wonder if this was merely a configuration issue in Heritrix or possibly the Wayback machine, but apparently this problem is plaguing web archivists everywhere.

An example of a problematic site is the Smithsonian's Mobile website:

A screenshot of the Smithsonian's mobile website as it appears on the Internet.

The Smithsonian's Mobile website contains a large amount of JavaScript in the source code:

A screenshot of the Smithsonian mobile site's source code displaying a large amount of JavaScript co

This is the result of the JavaScript-heavy site crawled and viewed in the Wayback machine:

A screenshot of the Smithsonian mobile website as it appears in the Wayback machine.

After spending some time reading articles and blogs about Heritrix, it seems that there is not any specific way to capture this content. The Heritrix configuration settings allow us to capture minimal JavaScript and Flash content, and for some sites this is enough. However, there is a significant amount of missing content, and as of now, there is nothing we can do about it. The most interesting (and frustrating) part about this dilemma is that if you check the Heritrix content logs, all of the JavaScript and Flash content is there, but they will not link up to produce working content. Therefore, all the "dots" are there, but they can't be connected.

The online exhibition for Roberto Clemente is an example of a site with a small amount of JavaScript and Flash; this is how it appears on the Internet:

A screenshot of the Roberto Clemente online exhibit as it appears on the internet, with the mouse ho Here is the same site as it appears on the Wayback machine. The crawl was successful due to minimal JavaScript and Flash content:

A screenshot of the Roberto Clemente online exhibit as it appears on the Wayback machine, with the m

Is all hope lost for these websites? Luckily, no! Web developers and web archivists acknowledge that this is a significant problem and are currently working toward developing programs that are specifically designed to capture JavaScript and Flash content. We also hope some software and hardware upgrades will help, too. It's an exciting time to be a budding web archivist and I can't wait to see how technological advancements such as these enhance current web archiving practices.

Related Resources

Saving the Smithsonian’s Web, The Bigger Picture Blog, Smithsonian Institution Archives
Archiving the Smithsonian’s Presence on the Internet, The Bigger Picture Blog, Smithsonian Institution Archives
Crawling JavaScript, Internet Archive

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Search Google Appliance

Connecting the Dots: Issues with Preserving Complex Websites

Related Resources

Leave a Comment