Web Archiving Update, October 2014

Screenshot of

Last week, we celebrated two years of using Archive-It for documenting the Smithsonian Institution's web presence. Previously, we had been using an in-house software and hardware installation in order to crawl websites and had cobbled together various less-than-ideal methods for capturing social media. Our hope was that a subscription to Archive-It would allow us to capture our web presence in a more efficient manner as well as allow us to provide better access to our crawled web content.

So how are we doing?

The Smithsonian currently has a total of 349 distinct websites and blogs. In the last year, we've crawled 170 of them or approximately 49% of the total. Altogether, we've crawled 327 websites and blogs, about 94% of the total, since we began using Archive-It two years ago. In addition, a significant number have been crawled more than once. Of those that have yet to be crawled, the majority have underlying code that make them nearly impossible to crawl using the technology currently available to us.

By this point, we had hoped to be crawling our websites and blogs annually. Although we haven't reached that goal, we've certainly improved from approximately one-half of our websites in 2 ½ years prior to using Archive-It, to nearly all of our websites and blogs in less than two years with Archive-It. And there's the added bonus of most of our crawled content from the last two years being available online via our Smithsonian Institution Websites Collection on Archive-It.

We continue to take steps to improve our efficiency.  One of our next steps will be to evaluate the websites we've already crawled to determine which ones do not need to be crawled again because they are no longer being updated. An example might be an online exhibition that was launched in its final format and was never intended to be modified. The fewer websites that need to be crawled, the more frequently we'll be able to capture those that do.

Related Collections

Related Resources

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.