Web Archiving Update

Next week, I'll be participating in a session called "The Web of Sites: Creating Effective Web Archiving Appraisal and Collection Development Policies" at the Society of American Archivists annual meeting.  For those of you who won't be there, I thought it might be a good time to provide an update on the Smithsonian Institution Archives web archiving activities.

In a blog post earlier this year, I announced that the Archives had begun using a subscription service, Archive-It, to preserve the Smithsonian's web presence.  We had previously been using our own installation of the Heritrix software, also used by Archive-It, to crawl and store websites locally.  With the move to Archive-It, we hoped to more efficiently crawl websites, as well as to provide better access to the crawled websites.

Yes, we archive this blog, too. A screenshot of the post referenced above, crawled June 16, 2013.

How are we doing?  Of the 365 websites and blogs currently being maintained by the Smithsonian, the Archives has crawled 120, or approximately one-third, in the 10 months since we began using Archive-It in October 2012.  This is still far from our goal of crawling each of our websites every year, but is still a significant improvement.  In the two and a half years during which we were using our local installation of Heritrix, we only crawled about one-half of our websites.  We are optimistic that our numbers will continue to improve.  With more experience, we expect that we will become more efficient at setting up, troubleshooting, and reviewing crawls.

A screenshot of the website for the "Piano 300: Celebrating Three Centuries of People and Pianos" ex

As for providing better access to crawled websites, 107 crawled websites and blogs (89 percent of those crawled using Archive-It) are now available online.  More will be available as the review process is completed.   Earlier crawls performed in-house are only available via a few local computers.  We're very happy with this improvement.

Please check out our progress!

Related Resources

Related Collections

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.