Web and Social Media Archiving

The Smithsonian embraced the potential of the Internet early on as it launched its homepage in 1995 to much fanfare. The National Museum of Natural History was using Gopher technology in 1993. The Smithsonian Astrophysical Observatory launched its Telescope Data Center website that year, which was one of the first 250 websites on the Internet and is still active today. The Smithsonian now has hundreds of public websites representing its nineteen museums, nine research centers, zoo, archives, libraries, and cultural heritage programs.

The Smithsonian Institution Archives (SIA) began archiving websites because they serve as the public face of the Smithsonian and document the history of the Institution. Websites and social media accounts contain information often not available anywhere else. These records also demonstrate the evolution of web design and the rise of social media in the early 21st century.

SIA started researching options for website archiving and preservation in the early 2000s. SIA’s website and social media collections have been curated and captured in a variety of ways.

Transferring website files in original formats (HTML, images, stylesheets, etc) from the Office of the Chief Information Officer, usually from a content management system (for a short period, HTML files were preserved as XHTML files and in some cases, web records are still transferred to SIA when a website is being shut down).
Crawling websites (crawling is done via a software that goes through the web in an automated way to capture content)
- HTTrack (files captured in original format)
- Heritrix (ARC and WARC files)
- Archive-It (WARC files)
- Web Recorder (WARC files)
Capturing social media accounts including Facebook, Twitter, and YouTube
- Heritrix (ARC and WARC files)
- PDF and screenshot captures
- TAGS and other online platforms no longer available to capture tweets in a spreadsheet
- Archive-It (WARC files)
- Web Recorder (WARC files)

SIA’s current procedures use the Archive-It service and Web Recorder. Both of these tools provide archival container files known as WARCs, or Web ARChive, that have the data gathered from the web crawl and a metadata header that includes the crawl date and the site name that was crawled. The WARC format is an international standard and requires a viewer like the Wayback Machine for accessibility. Both SIA and Archive-It have copies of these WARC files. Any crawls done with Archive-It are publicly accessible usually within 24 hours, if not restricted.

Web crawling is a snapshot of a site at a particular point in time and should be considered a best-attempt effort since it is possible that not everything can be captured because a webmaster has blocked a crawler or a server is down. Other challenges include complex sites with dynamic content that might be problematic to play back in a viewer.

Related Resources

Website records, Smithsonian Institution Archives

Five Questions for the Smithsonian Institution Archives’ Lynda Schmitz Fuhrig, The Signal Blog, The Library of Congress

Search Google Appliance

Web and Social Media Archiving

Related Blog Posts

Related Resources