Next week, I'll be participating in a session called "The Web of Sites: Creating Effective Web Archiving Appraisal and Collection Development Policies" at the Society of American Archivists annual meeting. For those of you who won't be there, I thought it might be a good time to provide an update on the Smithsonian Institution Archives web archiving activities.
In a blog post earlier this year, I announced that the Archives had begun using a subscription service, Archive-It, to preserve the Smithsonian's web presence. We had previously been using our own installation of the Heritrix software, also used by Archive-It, to crawl and store websites locally. With the move to Archive-It, we hoped to more efficiently crawl websites, as well as to provide better access to the crawled websites.
How are we doing? Of the 365 websites and blogs currently being maintained by the Smithsonian, the Archives has crawled 120, or approximately one-third, in the 10 months since we began using Archive-It in October 2012. This is still far from our goal of crawling each of our websites every year, but is still a significant improvement. In the two and a half years during which we were using our local installation of Heritrix, we only crawled about one-half of our websites. We are optimistic that our numbers will continue to improve. With more experience, we expect that we will become more efficient at setting up, troubleshooting, and reviewing crawls.
As for providing better access to crawled websites, 107 crawled websites and blogs (89 percent of those crawled using Archive-It) are now available online. More will be available as the review process is completed. Earlier crawls performed in-house are only available via a few local computers. We're very happy with this improvement.
Please check out our progress!
- Smithsonian Now Using Archive-It to Crawl Websites, The Bigger Picture blog, Smithsonian Institution Archives
- Smithsonian Institution Collections on Archive-It, Smithsonian Institution Archives
- Website finding aids, Smithsonian Institution Archives