Smithsonian Institution Archives
  • Collections
  • Services
  • Smithsonian History
  • About
  • Education
  • Blog
  • Forums
  • Press
  • Audiences
  • Donate

The Bigger Picture: Visual Archives and the Smithsonian

Saving the Smithsonian’s Web

by Robin C. Davis, Intern on August 25, 2011

This post is an update to Lynda Schmitz Fuhrig's post “Archiving the Smithsonian’s Presence on the Internet” from September 2, 2010.

The evolution of the websites of the National Museum of Natural History, National Portrait Gallery, National Air and Space Museum, and Hirshhorn Museum, From left to right, how they looked in 1998/2000, 2003, and 2011. Credit: Smithsonian Institution Archives and the Internet Archive.

The Smithsonian Institution has had a presence on the Internet for more than sixteen years. It’s come a long way since then. Documenting the Smithsonian’s various websites falls under the purview of the Smithsonian Institution Archives...but how do we do it?

As a web preservation intern at the Archives this summer, I’ve helped to develop the workflow for preserving Smithsonian-affiliated web content. Our goal is to take an annual “snapshot” of all Smithsonian public websites to be kept in the Archives.

Why do we preserve websites?

Institutional websites are important to preserve because they are:

  • records of institutional activity;
  • publications exposed in the public sphere; and
  • artifacts of historical and heritage value.

(Adapted from PoWR: Preservation of Web Resources Handbook, JISC, 2008.)

How do we preserve websites?

While each unit or office within the Smithsonian maintains and backs up the web content they create, the best way for the Archives to get a comprehensive snapshot of all the websites as they appear online is to use a web crawler. Crawlers, or spiders, are programs that browse the Internet by following trails of links, typically to index or save the content they encounter.

We use Heritrix, open-source crawling software developed by the Internet Archive, to conduct focused captures of individual websites according to our specifications and schedule. Heritrix bundles all the web content it crawls into .WARC files, an archival file format

A screenshot of Heritrix’s progress crawling the Smithsonian Marine Station website.

We need special software to view the content of the WARC files and perform a quality control check to make sure everything looks right and nothing is missing. We’re using the Wayback application, also developed by the Internet Archive. The local application looks and acts just like the Wayback Machine online. Once we’re satisfied with the captured website, we accession the WARC files and they’re officially part of the Archives’ holdings.

A screenshot of the American Art Museum’s Eyelevel Blog as reviewed in Wayback.

Future researchers will also have to use Wayback or other WARC-reading software to view preserved web collections. They might be interested in the content of web-published news releases, the structure of the Smithsonian’s extensive online image collections, or what was deemed worthy of a blog post (!).

Issues encountered

The road to web preservation is not without a few bumps. A few issues we’ve encountered are:

  • Estimating the size of site. Seemingly small, innocuous websites can actually contain many thousands of documents. One of the largest single crawls so far was the website of the National Museum of Natural History’s Botany department, which took 49 hours and 57 minutes to capture 78,922 files. To budget our time, we need to estimate how big a website is, and we use specific software tools like link validation programs to do that.
  • Deciding what external content to capture. How do you tell a web crawler that you want it to follow a link in a blog post to a useful article elsewhere on the Smithsonian website, but not to follow a link to a spam site in the comments? For blogs, we configure Heritrix to accept embedded off-domain content, like photos from Flickr, but not to scrape linked off-domain sites. For non-blog Smithsonian sites, we don’t capture any off-domain content at all. In both instances, we can also specify any URL patterns that are acceptable.

We’re still learning how best to use these tools to fit the needs of the Archives, and in the past two months, we’ve made a lot of progress:

114 crawls performed

541 hours of crawling

684,264 pieces of content captured (includes HTML pages, JPEG images, MP3 audio, etc.)

That means that so far, we’ve reached about two-thirds of this year’s snapshot goal.

 

Categories: What Gets Saved
Tags: Web/Tech, Archive, Digitization, Behind the Scenes
Comments: View 9 comments, or Give us yours!
All comments are moderated and subject to approval. Further information is available in The Bigger Picture’s Commenting Guidelines.

Comments (9) – Leave a comment

Chris Fellows

Wow, that’s a lot of archiving. Interesting article. I didn’t realize that historical archiving exists for the web too! Sounds like some interesting open source. How about Apache Nutch and Lucene?

Chris Fellows August 26, 2011 at 12:06 pm
  • reply
Lynda Schmitz Fuhrig

Hi Chris,
Thanks. Web archiving is fascinating, especially when you start to follow the evolution of websites during the past two decades.

See http://archive-access.sourceforge.net/projects/nutch/ for information about NutchWax for indexing.

Lynda
http://siarchives.si.edu/

Lynda Schmitz Fuhrig September 13, 2011 at 1:36 pm
  • reply
John Hunter

This is important work, I am glad you are working on it. It is frustrating how poor the ability to find electronic information is. We have the computing power and tools to do a better job. The internet archive is a very nice (but limited) resource. I hope you succeed in doing this well. It will be a huge advantage to us in the future.

John Hunter August 27, 2011 at 8:04 am
  • reply
Lynda Schmitz Fuhrig

John,
Thank you for your note. Yes, we believe this is important to do. Other institutions such as the Library of Congress, Harvard, and the California Digital Library are also invested in web archiving. See http://www.netpreserve.org/about/archiveList.php for more information.

Lynda
http://siarchives.si.edu/

Lynda Schmitz Fuhrig September 13, 2011 at 1:43 pm
  • reply
stephen

I can understand why you want to archive the Smithsonian websites, but I don’t understand why you are using a crawler? Given that you control the webserver and database back-end, why not directly back-up from there instead of crawling?

stephen August 27, 2011 at 2:16 pm
  • reply
Lynda Schmitz Fuhrig

Hi Stephen,
Good question.

Actually the Smithsonian uses more than one web content management system for its various websites. The Heritrix crawler gives us at the Smithsonian Institution Archives the flexibility and control that we do not have when retrieving from web servers.

-- We determine when the crawl will be done.
-- The output of the crawl (the WARCs) is a complete archival package.
-- The crawler also creates log/audit files that we do not get from copying files out from web servers.

Lynda
http://siarchives.si.edu/

Lynda Schmitz Fuhrig September 13, 2011 at 2:15 pm
  • reply
Ally Lennon

As time goes on and the internet continues to change ever more rapidly and fundamentally, these sites will start to ‘resemble’ a lot of the old photos in your collection.
Good job on keeping them alive!

Ally Lennon August 30, 2011 at 12:06 pm
  • reply
Lynda Schmitz Fuhrig

Ally,
That is an interesting way of looking at it. Thanks for comment.

Lynda
http://siarchives.si.edu/

Lynda Schmitz Fuhrig September 13, 2011 at 2:08 pm
  • reply
Raymond

Its a good news that the Smithsonian Institution has had a presence on the Internet for more than sixteen years. It’s come a long way since then. Documenting the Smithsonian’s various websites falls under the purview of the Smithsonian Institution Archives.I had never see a blog batter than this blog,I enjoyed this post.Thank you for sharing to us.Thanks again and good luck!!!
Samrx

Raymond July 9, 2012 at 7:55 am
  • reply

Leave a comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
By submitting this form, you accept the Mollom privacy policy.

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Stay in touch!

Facebook Twitter Flickr YouTube SlideShare
Join our eNewsletter

About

Connecting you to America’s past with a behind-the-scenes exploration of the Smithsonian’s history, treasures, and the challenges that Archives face preserving collections. More details...

Smithsonian on Flickr Commons

Topics/Tags

  • See Here (614)
  • American History (553)
  • Science (436)
  • Archive (338)
  • Cities/Places (281)
  • Exhibitions (236)
  • Web/Tech (215)
  • Photo History (190)
  • Link Love (157)
  • Politics/Government (154)

Blog Roll

All Smithsonian blogs
American Historical Association Blog
American Institute of Conservation Blog
Archives Next
Archives of American Art
Around the Mall
Field Book Project
Hanging Together
Library of Congress Blogs
National Archives (US) Blogs
National Museum of American History, O say can you see?
Smithsonian Collections Blog
Smithsonian Libraries
Teaching American History

Categories

  • Collections in Focus (1001)
  • What Gets Saved (342)
  • Behind the Scenes (213)
  • Smithsonian History (140)

Recent Posts

  • Sneak Peek 6/17/2013
  • Link Love: 6/14/2013
  • Summertime on the Mall - Smithsonian Folklife Festival
  • Women in Science Wednesday: Libbie Henrietta Hyman
  • Transcription Beyond Description: Engaging Opportunities and Weaving Webs of Knowledge

Monthly Archive

  • June 2013 (12)
  • May 2013 (32)
  • April 2013 (26)
  • March 2013 (26)
  • February 2013 (26)
  • January 2013 (28)
  • December 2012 (26)
  • November 2012 (28)
  • October 2012 (32)
  • September 2012 (26)
  • August 2012 (31)
  • July 2012 (26)
  • June 2012 (27)
  • May 2012 (27)
  • April 2012 (27)
  • March 2012 (28)
  • February 2012 (27)
  • January 2012 (26)
  • December 2011 (31)
  • November 2011 (28)
  • October 2011 (35)
  • September 2011 (31)
  • August 2011 (35)
  • July 2011 (41)
  • June 2011 (43)
  • May 2011 (33)
  • April 2011 (40)
  • March 2011 (43)
  • February 2011 (35)
  • January 2011 (36)
  • December 2010 (42)
  • November 2010 (40)
  • October 2010 (44)
  • September 2010 (37)
  • August 2010 (39)
  • July 2010 (38)
  • June 2010 (37)
  • May 2010 (42)
  • April 2010 (44)
  • March 2010 (47)
  • February 2010 (40)
  • January 2010 (39)
  • December 2009 (43)
  • November 2009 (34)
  • October 2009 (11)
  • September 2009 (11)
  • August 2009 (12)
  • July 2009 (14)
  • June 2009 (10)
  • May 2009 (12)
  • April 2009 (14)
  • March 2009 (10)
  • January 2009 (1)
Smithsonian Institution Archives
eNewsletter Facebook Twitter Flickr Historypin YouTube SlideShare Browsealoud
Smithsonian Institution
  • Privacy
  • Copyright
  • Contact