Smithsonian Institution Archives
  • Collections
  • Services
  • Smithsonian History
  • About
  • Education
  • Blog
  • Forums
  • Press
  • Audiences
  • Donate

The Bigger Picture: Visual Archives and the Smithsonian

Connecting the Dots: Issues with Preserving Complex Websites

by Anne Leon, Intern, Digital Services Division on July 17, 2012

While interning at the Smithsonian Institution Archives this summer, I have had the opportunity to participate in archiving and preserving the Smithsonian's web presence. As a current Master's student specializing in Archives and Record management and Preservation of Information, I could not resist the prospect of learning an area of archiving completely foreign to me—web archiving. This endeavor has been both educational and challenging, and I would like to share some of the issues with web archiving that I have encountered.

In previous blog posts, we shared how websites are preserved in much greater detail than I will discuss here, but it is important to recount some of the information. We utilize an open-source software program called Heritrix to "crawl" websites, and what this means is that we capture how the site looks and functions on the day that we perform a crawl. After the capture is complete, we view the files on the Wayback machine in order to make sure that we have: A) captured an accurate visual representation of the actual "live" website and B) we can navigate the crawled site in the same way that we would the live site. This all seems easy enough, right?  I thought so until I began viewing some of the captured websites on the Wayback machine. Heritrix and the Wayback machine capture and display most of our sites wonderfully, but as I began to crawl and assess more sites I noticed patterns of missing content on websites that contained a lot of interactive features and complex design layouts.

I began doing some research on the problematic sites by looking at their source codes and it became clear that sites that contain a lot of JavaScript and Adobe Flash content do not capture properly. Flash can be problematic during playback, but JavaScript is even more frustrating because sometimes the content captures, other times it does not. I began to wonder if this was merely a configuration issue in Heritrix or possibly the Wayback machine, but apparently this problem is plaguing web archivists everywhere.

An example of a problematic site is the Smithsonian's Mobile website:

A screenshot of the Smithsonian’s mobile website as it appears on the Internet.

The Smithsonian's Mobile website contains a large amount of JavaScript in the source code:

A screenshot of the Smithsonian mobile site’s source code displaying a large amount of JavaScript content.

This is the result of the JavaScript-heavy site crawled and viewed in the Wayback machine:

A screenshot of the Smithsonian mobile website as it appears in the Wayback machine.

After spending some time reading articles and blogs about Heritrix, it seems that there is not any specific way to capture this content.  The Heritrix configuration settings allow us to capture minimal JavaScript and Flash content, and for some sites this is enough. However, there is a significant amount of missing content, and as of now, there is nothing we can do about it. The most interesting (and frustrating) part about this dilemma is that if you check the Heritrix content logs, all of the JavaScript and Flash content is there, but they will not link up to produce working content. Therefore, all the "dots" are there, but they can't be connected.

The online exhibition for Roberto Clemente is an example of a site with a small amount of JavaScript and Flash; this is how it appears on the Internet:

A screenshot of the Roberto Clemente online exhibit as it appears on the internet, with the mouse hovering over “Roberto Clemente’s Story” and a screenshot of the page’s source code displaying a small amount of JavaScript.Here is the same site as it appears on the Wayback machine. The crawl was successful due to minimal JavaScript and Flash content:

A screenshot of the Roberto Clemente online exhibit as it appears on the Wayback machine, with the mouse hovering over “Roberto Clemente’s Story.”

Is all hope lost for these websites?  Luckily, no!  Web developers and web archivists acknowledge that this is a significant problem and are currently working toward developing programs that are specifically designed to capture JavaScript and Flash content.  We also hope some software and hardware upgrades will help, too.  It's an exciting time to be a budding web archivist and I can't wait to see how technological advancements such as these enhance current web archiving practices.  

Related Resources

  • Saving the Smithsonian’s Web, The Bigger Picture Blog, Smithsonian Institution Archives
  • Archiving the Smithsonian’s Presence on the Internet, The Bigger Picture Blog, Smithsonian Institution Archives
  • Crawling JavaScript, Internet Archive
Categories: Behind the Scenes
Tags: Web/Tech, Archive, Digitization, Behind the Scenes
Comments: View 3 comments, or Give us yours!
All comments are moderated and subject to approval. Further information is available in The Bigger Picture’s Commenting Guidelines.

Comments (3) – Leave a comment

Heather

Indeed, the fact that some sites were created several years ago is challenging. The Roberto Clemente website, for example, was published in 2005, when Flash was THE tool for bringing interactivity to the web.

The trick for web historians will be how to see and understand this kind of content in spite of the technology that's used to deliver it.

Heather July 17, 2012 at 11:52 am
  • reply
Hadi Mallah

Robots exclusion standard: The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable.

http://en.wikipedia.org/wiki/Robots_Exclusion_Standard

If a webpage you want to crawl excludes the Archive-It robot, you should try contacting the webmaster for the site, letting him or her know why you want to archive their site, and request that they allow the Archive-It robot to crawl their site.

The webmaster will need to know the name of the Archive-It robot (or crawler): It is archive.org_bot

The Heritrix web crawler is set to very polite settings for all Archive-It crawling and should not impact the site being crawled in anyway.

Hadi Mallah July 18, 2012 at 8:49 am
  • reply
Lynda Schmitz Fuhrig

Hadi,
You are right about the settings of the Heritrix web crawler honoring robots.txt. Since we are crawling Smithsonian public websites, we have Heritrix set to ignore robots.txt in those instances so we get more complete results.

Lynda

Lynda Schmitz Fuhrig July 20, 2012 at 4:40 pm
  • reply

Leave a comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
By submitting this form, you accept the Mollom privacy policy.

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Stay in touch!

Facebook Twitter Flickr YouTube SlideShare
Join our eNewsletter

About

Connecting you to America’s past with a behind-the-scenes exploration of the Smithsonian’s history, treasures, and the challenges that Archives face preserving collections. More details...

Smithsonian on Flickr Commons

Topics/Tags

  • See Here (611)
  • American History (542)
  • Science (429)
  • Archive (329)
  • Cities/Places (277)
  • Exhibitions (234)
  • Web/Tech (210)
  • Photo History (189)
  • Link Love (153)
  • Politics/Government (153)

Blog Roll

All Smithsonian blogs
American Historical Association Blog
American Institute of Conservation Blog
Archives Next
Archives of American Art
Around the Mall
Field Book Project
Hanging Together
Library of Congress Blogs
National Archives (US) Blogs
National Museum of American History, O say can you see?
Smithsonian Collections Blog
Smithsonian Libraries
Teaching American History

Categories

  • Collections in Focus (988)
  • What Gets Saved (337)
  • Behind the Scenes (212)
  • Smithsonian History (134)

Recent Posts

  • See Here: 5/17/2013
  • Link Love: 5/17/2013
  • Weird and Wonderful: The Surprising Mrs. Hilda Hempl Heller
  • Women in Science Wednesday: Anne Hagopian
  • Sneak Peek 5/15/2013

Monthly Archive

  • May 2013 (20)
  • April 2013 (26)
  • March 2013 (26)
  • February 2013 (26)
  • January 2013 (28)
  • December 2012 (26)
  • November 2012 (28)
  • October 2012 (32)
  • September 2012 (26)
  • August 2012 (31)
  • July 2012 (26)
  • June 2012 (27)
  • May 2012 (27)
  • April 2012 (27)
  • March 2012 (28)
  • February 2012 (27)
  • January 2012 (26)
  • December 2011 (31)
  • November 2011 (28)
  • October 2011 (35)
  • September 2011 (31)
  • August 2011 (35)
  • July 2011 (41)
  • June 2011 (43)
  • May 2011 (33)
  • April 2011 (40)
  • March 2011 (43)
  • February 2011 (35)
  • January 2011 (36)
  • December 2010 (42)
  • November 2010 (40)
  • October 2010 (44)
  • September 2010 (37)
  • August 2010 (39)
  • July 2010 (38)
  • June 2010 (37)
  • May 2010 (42)
  • April 2010 (44)
  • March 2010 (47)
  • February 2010 (40)
  • January 2010 (39)
  • December 2009 (43)
  • November 2009 (34)
  • October 2009 (11)
  • September 2009 (11)
  • August 2009 (12)
  • July 2009 (14)
  • June 2009 (10)
  • May 2009 (12)
  • April 2009 (14)
  • March 2009 (10)
  • January 2009 (1)
Smithsonian Institution Archives
eNewsletter Facebook Twitter Flickr Historypin YouTube SlideShare Browsealoud
Smithsonian Institution
  • Privacy
  • Copyright
  • Contact