Manuscripts, field books, images, and documents are not the only resources for historical information. Researchers, scholars, and journalists also are making use of website and social media archives. There are now more than twenty years of website archives in a variety of collections including the Smithsonian Institution Archives, University of North Texas CyberCemetery, and Internet Archive.
Archived web collections can be used in a variety of ways: finding a site or article that is no longer on the Internet, finding a web page that has changed its design, extracting/downloading data for analysis, or detecting trends. Penn State recently awarded grant money for projects using Twitter data. Some research areas include fake news, disaster relief, and health issues.
The most well-known resource when it comes to web archives is the Wayback Machine from the Internet Archive. Launched in 2001, it now has more than 351 billion web pages archived and content dating back to 1996. Entering a URL such as https://www.si.edu will display thousands of dates for when it was crawled (also known as captured) by its software. The Wayback Machine also offers a feature for anyone to use in order to archive/capture a particular web page called Save Page Now. This will work as long as that page allows web crawlers and doesn’t have a password requirement.
There are a few ways you can find the websites the Smithsonian Institution Archives has in its collections.
- Through the Smithsonian Institution Archives website you can search the finding aids for “website records” from the collections search. There is a link at the bottom of the finding aid that goes to the capture through Archive-It, a service the Smithsonian Insitution Archives uses to crawl its web presence (Archive-It is owned by the Internet Archive).
- You also can go to the Archive-It site and search for the Smithsonian in the Collections box. Here, you can navigate through the various websites including Birds of DC, Earth Optimism, and the Arts and Industries Building. It also has a search feature that queries text within pages or a URL itself. A search of artist “Doug Aitken,” for instance, has more than 2,000 hits across Smithsonian crawls. Archive-It has also archived web collections of various schools, universities, archives, libraries, and government agencies in the United States and sixteen countries.
- And then there is the WaybackMachine from the Internet Archive itself to search for Smithsonian content. It’s helpful to have the URL but a keyword can work in some instances as well. Linkages between pages are not always available when you use the Wayback Machine.
Any web collections older than 2012 from the Smithsonian are not online, but access can be provided by contacting the Archives’ reference staff. One thing to keep in mind is the fluid nature of the web. A web crawl is a snapshot in time. It’s possible not everything was captured nor can play back like it did on the live site.
As a side note, I had a tour of the Internet Archive in San Francisco in February while attending a conference on art libraries and web archives collections. Besides archiving the web, the Internet Archive is a free digital library with other projects that function to scan books and texts, digitize 78 rpm records, and share software, video, and audio collections.