Smithsonian Now Using Archive-It to Crawl Websites

In September 2012, the Smithsonian Institution Archives began using Archive-It, a service of the Internet Archive, to crawl its almost 250 websites.  Archive-It is "a web archiving service to harvest and preserve digital collections" that is used by more than 200 organizations.

A screenshot of the Smithsonian Institution homepage, crawled October 9, 2012.  This was the first w

This is exciting news for us as archivists.  While Archive-It uses the same software for crawling and viewing websites as we had been using for the past three years, we have been plagued with hardware issues and have not been able to keep our software up-to-date.  We now have access to software updates as soon as they are available.  The processes of setting up a crawl and reviewing it afterwards are also more user-friendly with Archive-It.  In addition, we now have the benefit of support from both the Archive-It staff and the larger Archive-It user community for those times when we just cannot figure out why a crawl is not working.

The switch to Archive-It should be exciting for non-archivists too.  Our website crawls using Archive-It should be of the same or higher quality as the websites we have been crawling on our own.  That means they have the potential to be more complete and with fewer errors. Most crawled websites will be available within 24 hours from the Archive-It website for both researchers and the general public alike. They can be viewed at any time from any computer without contacting the Archives or requiring any login.

To view these websites, go directly to the Smithsonian Institution's page at Archive-It or type "Smithsonian Institution" into the Explore Collecting Organizations search box on the Archive-It homepage.  There are two collections listed.  "Smithsonian Institution Special" includes portions of Smithsonian websites that are crawled at a specific time to capture content related to a special event such as the Presidential Inauguration.  "Smithsonian Websites" includes regularly scheduled crawls of most or all of a website.

A screenshot of the Smithsonian Institution collections page on Archive-It.  

Once a website is chosen, a timeline will open, allowing a user to see the dates on which the website was crawled and to choose which date to view.  Because we have only recently begun using Archive-It, there will generally only be one date available.  The website will then load with a banner proclaiming the exact date and time the page was captured.  Users can then click through the links as if using a live website.  If a link was not included as part of the crawl, Archive-It will attempt to find a version of the page elsewhere within the Smithsonian’s collections or within the Internet Archive and will display the nearest date.  A link to the same page on the live web is also provided.

A screenshot of the timeline showing the dates on which Smithsonian Institution.

As of this writing, the Smithsonian Institution has 41 crawled websites on Archive-It.  Internet Archive also crawls Smithsonian websites from time to time and makes those crawls available via its Wayback Machine which is a separate search engine from Archive-It.  In addition, the Smithsonian Institution Archives has more than 260 captured websites and blogs, many illustrating the same sites over time, that are not available in Archive-It (see a partial list of finding aids for these collections).  We are currently working towards making both the Archive-It and pre-Archive-It web captures available via our finding aids.

Related Resources

Related Collections

Leave a Comment

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.