Smithsonian Institution Archives
  • Collections
  • Services
  • Smithsonian History
  • About
  • Education
  • Blog
  • Forums
  • Press
  • Audiences
  • Donate

The Bigger Picture: Visual Archives and the Smithsonian

Smithsonian Now Using Archive-It to Crawl Websites

by Jennifer Wright on February 26, 2013

In September 2012, the Smithsonian Institution Archives began using Archive-It, a service of the Internet Archive, to crawl its almost 250 websites.  Archive-It is "a web archiving service to harvest and preserve digital collections" that is used by more than 200 organizations.

A screenshot of the Smithsonian Institution homepage, crawled October 9, 2012.  This was the first website to be crawled by the Smithsonian Institution Archives using Archive-It.

This is exciting news for us as archivists.  While Archive-It uses the same software for crawling and viewing websites as we had been using for the past three years, we have been plagued with hardware issues and have not been able to keep our software up-to-date.  We now have access to software updates as soon as they are available.  The processes of setting up a crawl and reviewing it afterwards are also more user-friendly with Archive-It.  In addition, we now have the benefit of support from both the Archive-It staff and the larger Archive-It user community for those times when we just cannot figure out why a crawl is not working.

The switch to Archive-It should be exciting for non-archivists too.  Our website crawls using Archive-It should be of the same or higher quality as the websites we have been crawling on our own.  That means they have the potential to be more complete and with fewer errors. Most crawled websites will be available within 24 hours from the Archive-It website for both researchers and the general public alike. They can be viewed at any time from any computer without contacting the Archives or requiring any login.

To view these websites, go directly to the Smithsonian Institution's page at Archive-It or type "Smithsonian Institution" into the Explore Collecting Organizations search box on the Archive-It homepage.  There are two collections listed.  "Smithsonian Institution Special" includes portions of Smithsonian websites that are crawled at a specific time to capture content related to a special event such as the Presidential Inauguration.  "Smithsonian Websites" includes regularly scheduled crawls of most or all of a website.

A screenshot of the Smithsonian Institution collections page on Archive-It. 

Once a website is chosen, a timeline will open, allowing a user to see the dates on which the website was crawled and to choose which date to view.  Because we have only recently begun using Archive-It, there will generally only be one date available.  The website will then load with a banner proclaiming the exact date and time the page was captured.  Users can then click through the links as if using a live website.  If a link was not included as part of the crawl, Archive-It will attempt to find a version of the page elsewhere within the Smithsonian’s collections or within the Internet Archive and will display the nearest date.  A link to the same page on the live web is also provided.

A screenshot of the timeline showing the dates on which Smithsonian Institution.

As of this writing, the Smithsonian Institution has 41 crawled websites on Archive-It.  Internet Archive also crawls Smithsonian websites from time to time and makes those crawls available via its Wayback Machine which is a separate search engine from Archive-It.  In addition, the Smithsonian Institution Archives has more than 260 captured websites and blogs, many illustrating the same sites over time, that are not available in Archive-It (see a partial list of finding aids for these collections).  We are currently working towards making both the Archive-It and pre-Archive-It web captures available via our finding aids.

Related Resources

  • Connecting the Dots: Issues with Preserving Complex Websites, The Bigger Picture blog, Smithsonian Institution Archives

Related Collections

  • Smithsonian Institution Collections on Archive-IT, Smithsonian Insitution Archives
  • Website finding aids, Smithsonian Institution Archives
Categories: What Gets Saved
Tags: Web/Tech, Archive
Comments: View 5 comments, or Give us yours!
All comments are moderated and subject to approval. Further information is available in The Bigger Picture’s Commenting Guidelines.

Comments (5) – Leave a comment

Anne Leon

Wow! It's great to hear that the SIA is using Archive-It to crawl its sites! It sounds like this transition has really facilitated the process. Congrats!

Anne Leon February 26, 2013 at 9:24 am
  • reply
Patrick Feinstein

That's great news. The search functionality will help us all delve deeper into the archives. Thanks to all involved!

Patrick Feinstein February 28, 2013 at 6:14 am
  • reply
Brosius

That seems like a great transition. Even though it's the same software for crawling and archiving the websites, it is a big improvement if the hardware doesn't have any issues. I really like the Wayback Machine functionality because it would be interesting to be able to see the sites throughout their history.

Brosius March 4, 2013 at 5:35 am
  • reply
Ciaran

This is great news, means we can search for older articles in the archive. Archive-It is a great feature!

Ciaran March 4, 2013 at 6:58 am
  • reply
Robin Davis

Looks great! I'm glad to see Archive-it makes the web preservation workflow easy — much more user-friendly than using Heritrix :) Your dedication to web archiving is inspirational.

Robin Davis March 5, 2013 at 11:12 am
  • reply

Leave a comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
By submitting this form, you accept the Mollom privacy policy.

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Stay in touch!

Facebook Twitter Flickr YouTube SlideShare
Join our eNewsletter

About

Connecting you to America’s past with a behind-the-scenes exploration of the Smithsonian’s history, treasures, and the challenges that Archives face preserving collections. More details...

Smithsonian on Flickr Commons

Topics/Tags

  • See Here (612)
  • American History (544)
  • Science (431)
  • Archive (332)
  • Cities/Places (279)
  • Exhibitions (235)
  • Web/Tech (211)
  • Photo History (189)
  • Link Love (154)
  • Politics/Government (153)

Blog Roll

All Smithsonian blogs
American Historical Association Blog
American Institute of Conservation Blog
Archives Next
Archives of American Art
Around the Mall
Field Book Project
Hanging Together
Library of Congress Blogs
National Archives (US) Blogs
National Museum of American History, O say can you see?
Smithsonian Collections Blog
Smithsonian Libraries
Teaching American History

Categories

  • Collections in Focus (991)
  • What Gets Saved (338)
  • Behind the Scenes (212)
  • Smithsonian History (136)

Recent Posts

  • See Here: 5/24/2013
  • Link Love: 5/24/2013
  • "If you feed them, they will come."
  • Women in Science Wednesday: Mary Alice McWhinnie
  • Twenty-Six and Blooming!

Monthly Archive

  • May 2013 (26)
  • April 2013 (26)
  • March 2013 (26)
  • February 2013 (26)
  • January 2013 (28)
  • December 2012 (26)
  • November 2012 (28)
  • October 2012 (32)
  • September 2012 (26)
  • August 2012 (31)
  • July 2012 (26)
  • June 2012 (27)
  • May 2012 (27)
  • April 2012 (27)
  • March 2012 (28)
  • February 2012 (27)
  • January 2012 (26)
  • December 2011 (31)
  • November 2011 (28)
  • October 2011 (35)
  • September 2011 (31)
  • August 2011 (35)
  • July 2011 (41)
  • June 2011 (43)
  • May 2011 (33)
  • April 2011 (40)
  • March 2011 (43)
  • February 2011 (35)
  • January 2011 (36)
  • December 2010 (42)
  • November 2010 (40)
  • October 2010 (44)
  • September 2010 (37)
  • August 2010 (39)
  • July 2010 (38)
  • June 2010 (37)
  • May 2010 (42)
  • April 2010 (44)
  • March 2010 (47)
  • February 2010 (40)
  • January 2010 (39)
  • December 2009 (43)
  • November 2009 (34)
  • October 2009 (11)
  • September 2009 (11)
  • August 2009 (12)
  • July 2009 (14)
  • June 2009 (10)
  • May 2009 (12)
  • April 2009 (14)
  • March 2009 (10)
  • January 2009 (1)
Smithsonian Institution Archives
eNewsletter Facebook Twitter Flickr Historypin YouTube SlideShare Browsealoud
Smithsonian Institution
  • Privacy
  • Copyright
  • Contact