The Bigger Picture: Visual Archives and the Smithsonian
Saving the Smithsonian’s Web
This post is an update to Lynda Schmitz Fuhrig's post “Archiving the Smithsonian’s Presence on the Internet” from September 2, 2010.
The Smithsonian Institution has had a presence on the Internet for more than sixteen years. It’s come a long way since then. Documenting the Smithsonian’s various websites falls under the purview of the Smithsonian Institution Archives...but how do we do it?
As a web preservation intern at the Archives this summer, I’ve helped to develop the workflow for preserving Smithsonian-affiliated web content. Our goal is to take an annual “snapshot” of all Smithsonian public websites to be kept in the Archives.
Why do we preserve websites?
Institutional websites are important to preserve because they are:
- records of institutional activity;
- publications exposed in the public sphere; and
- artifacts of historical and heritage value.
(Adapted from PoWR: Preservation of Web Resources Handbook, JISC, 2008.)
How do we preserve websites?
While each unit or office within the Smithsonian maintains and backs up the web content they create, the best way for the Archives to get a comprehensive snapshot of all the websites as they appear online is to use a web crawler. Crawlers, or spiders, are programs that browse the Internet by following trails of links, typically to index or save the content they encounter.
We use Heritrix, open-source crawling software developed by the Internet Archive, to conduct focused captures of individual websites according to our specifications and schedule. Heritrix bundles all the web content it crawls into .WARC files, an archival file format
We need special software to view the content of the WARC files and perform a quality control check to make sure everything looks right and nothing is missing. We’re using the Wayback application, also developed by the Internet Archive. The local application looks and acts just like the Wayback Machine online. Once we’re satisfied with the captured website, we accession the WARC files and they’re officially part of the Archives’ holdings.
Future researchers will also have to use Wayback or other WARC-reading software to view preserved web collections. They might be interested in the content of web-published news releases, the structure of the Smithsonian’s extensive online image collections, or what was deemed worthy of a blog post (!).
Issues encountered
The road to web preservation is not without a few bumps. A few issues we’ve encountered are:
- Estimating the size of site. Seemingly small, innocuous websites can actually contain many thousands of documents. One of the largest single crawls so far was the website of the National Museum of Natural History’s Botany department, which took 49 hours and 57 minutes to capture 78,922 files. To budget our time, we need to estimate how big a website is, and we use specific software tools like link validation programs to do that.
- Deciding what external content to capture. How do you tell a web crawler that you want it to follow a link in a blog post to a useful article elsewhere on the Smithsonian website, but not to follow a link to a spam site in the comments? For blogs, we configure Heritrix to accept embedded off-domain content, like photos from Flickr, but not to scrape linked off-domain sites. For non-blog Smithsonian sites, we don’t capture any off-domain content at all. In both instances, we can also specify any URL patterns that are acceptable.
We’re still learning how best to use these tools to fit the needs of the Archives, and in the past two months, we’ve made a lot of progress:
114 crawls performed
541 hours of crawling
684,264 pieces of content captured (includes HTML pages, JPEG images, MP3 audio, etc.)
That means that so far, we’ve reached about two-thirds of this year’s snapshot goal.
Comments (9) – Leave a comment
Wow, that’s a lot of archiving. Interesting article. I didn’t realize that historical archiving exists for the web too! Sounds like some interesting open source. How about Apache Nutch and Lucene?
Hi Chris,
Thanks. Web archiving is fascinating, especially when you start to follow the evolution of websites during the past two decades.
See http://archive-access.sourceforge.net/projects/nutch/ for information about NutchWax for indexing.
This is important work, I am glad you are working on it. It is frustrating how poor the ability to find electronic information is. We have the computing power and tools to do a better job. The internet archive is a very nice (but limited) resource. I hope you succeed in doing this well. It will be a huge advantage to us in the future.
John,
Thank you for your note. Yes, we believe this is important to do. Other institutions such as the Library of Congress, Harvard, and the California Digital Library are also invested in web archiving. See http://www.netpreserve.org/about/archiveList.php for more information.
I can understand why you want to archive the Smithsonian websites, but I don’t understand why you are using a crawler? Given that you control the webserver and database back-end, why not directly back-up from there instead of crawling?
Hi Stephen,
Good question.
Actually the Smithsonian uses more than one web content management system for its various websites. The Heritrix crawler gives us at the Smithsonian Institution Archives the flexibility and control that we do not have when retrieving from web servers.
-- We determine when the crawl will be done.
-- The output of the crawl (the WARCs) is a complete archival package.
-- The crawler also creates log/audit files that we do not get from copying files out from web servers.
As time goes on and the internet continues to change ever more rapidly and fundamentally, these sites will start to ‘resemble’ a lot of the old photos in your collection.
Good job on keeping them alive!
Its a good news that the Smithsonian Institution has had a presence on the Internet for more than sixteen years. It’s come a long way since then. Documenting the Smithsonian’s various websites falls under the purview of the Smithsonian Institution Archives.I had never see a blog batter than this blog,I enjoyed this post.Thank you for sharing to us.Thanks again and good luck!!!
Samrx
Leave a comment
Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.
About
Smithsonian on Flickr Commons
Topics/Tags
- See Here (614)
- American History (553)
- Science (436)
- Archive (338)
- Cities/Places (281)
- Exhibitions (236)
- Web/Tech (215)
- Photo History (190)
- Link Love (157)
- Politics/Government (154)
Blog Roll
Categories
- Collections in Focus (1001)
- What Gets Saved (342)
- Behind the Scenes (213)
- Smithsonian History (140)
Monthly Archive
- June 2013 (12)
- May 2013 (32)
- April 2013 (26)
- March 2013 (26)
- February 2013 (26)
- January 2013 (28)
- December 2012 (26)
- November 2012 (28)
- October 2012 (32)
- September 2012 (26)
- August 2012 (31)
- July 2012 (26)
- June 2012 (27)
- May 2012 (27)
- April 2012 (27)
- March 2012 (28)
- February 2012 (27)
- January 2012 (26)
- December 2011 (31)
- November 2011 (28)
- October 2011 (35)
- September 2011 (31)
- August 2011 (35)
- July 2011 (41)
- June 2011 (43)
- May 2011 (33)
- April 2011 (40)
- March 2011 (43)
- February 2011 (35)
- January 2011 (36)
- December 2010 (42)
- November 2010 (40)
- October 2010 (44)
- September 2010 (37)
- August 2010 (39)
- July 2010 (38)
- June 2010 (37)
- May 2010 (42)
- April 2010 (44)
- March 2010 (47)
- February 2010 (40)
- January 2010 (39)
- December 2009 (43)
- November 2009 (34)
- October 2009 (11)
- September 2009 (11)
- August 2009 (12)
- July 2009 (14)
- June 2009 (10)
- May 2009 (12)
- April 2009 (14)
- March 2009 (10)
- January 2009 (1)


