Manuscripts, field books, images, and documents are not the only resources for historical information. Researchers, scholars, and journalists also are making use of website and social media archives. There are now more than twenty years of website archives in a variety of collections including the Smithsonian Institution Archives, University of North Texas CyberCemetery, and Internet Archive.

Two screenshots of websites of the Freer Sackler Gallery. The top is set in a white background and a

Archived web collections can be used in a variety of ways: finding a site or article that is no longer on the Internet, finding a web page that has changed its design, extracting/downloading data for analysis, or detecting trends. Penn State recently awarded grant money for projects using Twitter data. Some research areas include fake news, disaster relief, and health issues.

Screenshot of a home page with a basic menu on the left side and a low-res image of a redish-brown b

The most well-known resource when it comes to web archives is the Wayback Machine from the Internet Archive. Launched in 2001, it now has more than 351 billion web pages archived and content dating back to 1996. Entering a URL such as https://www.si.edu will display thousands of dates for when it was crawled (also known as captured) by its software. The Wayback Machine also offers a feature for anyone to use in order to archive/capture a particular web page called Save Page Now. This will work as long as that page allows web crawlers and doesn’t have a password requirement.

Image of a finding aid with images next to cartoon floppy disks next to certain website descriptions

There are a few ways you can find the websites the Smithsonian Institution Archives has in its collections.

Through the Smithsonian Institution Archives website you can search the finding aids for “website records” from the collections search. There is a link at the bottom of the finding aid that goes to the capture through Archive-It, a service the Smithsonian Insitution Archives uses to crawl its web presence (Archive-It is owned by the Internet Archive).

Screenshot of archived websites, all including the name "Doug Aitken." The shot includes four websit

You also can go to the Archive-It site and search for the Smithsonian in the Collections box. Here, you can navigate through the various websites including Birds of DC, Earth Optimism, and the Arts and Industries Building. It also has a search feature that queries text within pages or a URL itself. A search of artist “Doug Aitken,” for instance, has more than 2,000 hits across Smithsonian crawls. Archive-It has also archived web collections of various schools, universities, archives, libraries, and government agencies in the United States and sixteen countries.
And then there is the WaybackMachine from the Internet Archive itself to search for Smithsonian content. It’s helpful to have the URL but a keyword can work in some instances as well. Linkages between pages are not always available when you use the Wayback Machine.

Any web collections older than 2012 from the Smithsonian are not online, but access can be provided by contacting the Archives’ reference staff. One thing to keep in mind is the fluid nature of the web. A web crawl is a snapshot in time. It’s possible not everything was captured nor can play back like it did on the live site.

A man in a patterned, button-up shirt and jeans stands in front of a computer screen and other large

As a side note, I had a tour of the Internet Archive in San Francisco in February while attending a conference on art libraries and web archives collections. Besides archiving the web, the Internet Archive is a free digital library with other projects that function to scan books and texts, digitize 78 rpm records, and share software, video, and audio collections.

Related Resources

What We Do: Web and Social Media Archiving, Smithsonian Institution Archives
The Archives Unleashed Project
International Internet Preservation Consortium
List of Web archiving initiatives, Wikipedia
Using Web Archives in Research, by Janne Nielsen, published by NetLab, 2016

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Searching and Using Web Archives

Related Resources

Leave a Comment

Search Google Appliance

Searching and Using Web Archives

Related Resources

Leave a Comment