The Bigger Picture: Visual Archives and the Smithsonian
Smithsonian Now Using Archive-It to Crawl Websites
In September 2012, the Smithsonian Institution Archives began using Archive-It, a service of the Internet Archive, to crawl its almost 250 websites. Archive-It is "a web archiving service to harvest and preserve digital collections" that is used by more than 200 organizations.
This is exciting news for us as archivists. While Archive-It uses the same software for crawling and viewing websites as we had been using for the past three years, we have been plagued with hardware issues and have not been able to keep our software up-to-date. We now have access to software updates as soon as they are available. The processes of setting up a crawl and reviewing it afterwards are also more user-friendly with Archive-It. In addition, we now have the benefit of support from both the Archive-It staff and the larger Archive-It user community for those times when we just cannot figure out why a crawl is not working.
The switch to Archive-It should be exciting for non-archivists too. Our website crawls using Archive-It should be of the same or higher quality as the websites we have been crawling on our own. That means they have the potential to be more complete and with fewer errors. Most crawled websites will be available within 24 hours from the Archive-It website for both researchers and the general public alike. They can be viewed at any time from any computer without contacting the Archives or requiring any login.
To view these websites, go directly to the Smithsonian Institution's page at Archive-It or type "Smithsonian Institution" into the Explore Collecting Organizations search box on the Archive-It homepage. There are two collections listed. "Smithsonian Institution Special" includes portions of Smithsonian websites that are crawled at a specific time to capture content related to a special event such as the Presidential Inauguration. "Smithsonian Websites" includes regularly scheduled crawls of most or all of a website.
Once a website is chosen, a timeline will open, allowing a user to see the dates on which the website was crawled and to choose which date to view. Because we have only recently begun using Archive-It, there will generally only be one date available. The website will then load with a banner proclaiming the exact date and time the page was captured. Users can then click through the links as if using a live website. If a link was not included as part of the crawl, Archive-It will attempt to find a version of the page elsewhere within the Smithsonian’s collections or within the Internet Archive and will display the nearest date. A link to the same page on the live web is also provided.
As of this writing, the Smithsonian Institution has 41 crawled websites on Archive-It. Internet Archive also crawls Smithsonian websites from time to time and makes those crawls available via its Wayback Machine which is a separate search engine from Archive-It. In addition, the Smithsonian Institution Archives has more than 260 captured websites and blogs, many illustrating the same sites over time, that are not available in Archive-It (see a partial list of finding aids for these collections). We are currently working towards making both the Archive-It and pre-Archive-It web captures available via our finding aids.
Related Resources
- Connecting the Dots: Issues with Preserving Complex Websites, The Bigger Picture blog, Smithsonian Institution Archives
Related Collections
- Smithsonian Institution Collections on Archive-IT, Smithsonian Insitution Archives
- Website finding aids, Smithsonian Institution Archives
Comments (5) – Leave a comment
Wow! It's great to hear that the SIA is using Archive-It to crawl its sites! It sounds like this transition has really facilitated the process. Congrats!
That's great news. The search functionality will help us all delve deeper into the archives. Thanks to all involved!
That seems like a great transition. Even though it's the same software for crawling and archiving the websites, it is a big improvement if the hardware doesn't have any issues. I really like the Wayback Machine functionality because it would be interesting to be able to see the sites throughout their history.
This is great news, means we can search for older articles in the archive. Archive-It is a great feature!
Looks great! I'm glad to see Archive-it makes the web preservation workflow easy — much more user-friendly than using Heritrix :) Your dedication to web archiving is inspirational.
Leave a comment
Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.
About
Smithsonian on Flickr Commons
Topics/Tags
- See Here (612)
- American History (544)
- Science (431)
- Archive (332)
- Cities/Places (279)
- Exhibitions (235)
- Web/Tech (211)
- Photo History (189)
- Link Love (154)
- Politics/Government (153)
Blog Roll
Categories
- Collections in Focus (991)
- What Gets Saved (338)
- Behind the Scenes (212)
- Smithsonian History (136)
Monthly Archive
- May 2013 (26)
- April 2013 (26)
- March 2013 (26)
- February 2013 (26)
- January 2013 (28)
- December 2012 (26)
- November 2012 (28)
- October 2012 (32)
- September 2012 (26)
- August 2012 (31)
- July 2012 (26)
- June 2012 (27)
- May 2012 (27)
- April 2012 (27)
- March 2012 (28)
- February 2012 (27)
- January 2012 (26)
- December 2011 (31)
- November 2011 (28)
- October 2011 (35)
- September 2011 (31)
- August 2011 (35)
- July 2011 (41)
- June 2011 (43)
- May 2011 (33)
- April 2011 (40)
- March 2011 (43)
- February 2011 (35)
- January 2011 (36)
- December 2010 (42)
- November 2010 (40)
- October 2010 (44)
- September 2010 (37)
- August 2010 (39)
- July 2010 (38)
- June 2010 (37)
- May 2010 (42)
- April 2010 (44)
- March 2010 (47)
- February 2010 (40)
- January 2010 (39)
- December 2009 (43)
- November 2009 (34)
- October 2009 (11)
- September 2009 (11)
- August 2009 (12)
- July 2009 (14)
- June 2009 (10)
- May 2009 (12)
- April 2009 (14)
- March 2009 (10)
- January 2009 (1)