Archiving the Smithsonian’s presence on the Internet

Have you been to the Smithsonian Institution lately? If you are reading The Bigger Picture, then the answer is yes. Not only do the nineteen museums, nine research centers, and zoo make up the physical Smithsonian Institution, but hundreds of public websites and social media sites (blogs, Facebook, Twitter, MySpace, etc.) are the Smithsonian, too. While there were 30 million physical visits in 2009 to the Smithsonian’s facilities, there were also 115 million unique visits to our various websites, which allow the Smithsonian to connect with folks all over the world who are unable to visit in person. The SI Archives is responsible for preserving many of these websites, since they contain valuable information that document the history of the Institution. Some Smithsonian websites and social media sites—blogs like this one, for example—contain content not duplicated anywhere else. Preservation work involves getting a copy of the site and trying to ensure it will be viewable years from now (archived sites are not available online from us, at this time). The Smithsonian Tropical Research Institute 1996 homepage, SIA Accession 05-032. We have been capturing Smithsonian websites (copies of the content/files) since the late 1990s through various methods. We are now using an open-source tool, available to all, called Heritrix to conduct our captures, otherwise known as crawls. A crawler visits a web page/s and retrieves the associated content. Various parameters can be set within the software to ensure that only specific items/links are captured. The results are saved in a file format called WARC (WebARChive), which is known as an archival container. (The WARC format became an international standard in 2009.) The crawls we are conducting give us a snapshot of what a particular Smithsonian website looked like at a specific point in time. It is not feasible for us to capture every update to a website, nor is it necessary. The Smithsonian Tropical Research Institute 2010 homepage, SIA Accession 10-125. Heritrix is the same crawler used by the Internet Archive, a nonprofit founded in 1996 with the goal of creating a global digital library of books, movies, music, and websites. If you have ever tried to find a website from years ago, you may have used the Internet Archive’s Wayback Machine (Enter a url to see a website from various dates from the past. For example, see the Smithsonian homepage from 2000). Other organizations, archives, and universities also have their own web archive collections. The Library of Congress’ web archive Minerva features certain topics such as 9/11, U.S. elections, and international conflicts. While our web archiving is limited to our own Smithsonian sites, we also are interested in capturing some of the content that has exploded with various social media sites, such as Facebook, Twitter, and blogs, including this one. The Library of Congress and Twitter announced earlier this year that Twitter would donate all public tweets to the Library of Congress. Nevertheless, we still need our own archive of content and believe it is important to document that the Smithsonian was using these various social media sites. As with all digital content, there are challenges in capturing material:

  • Web pages are constantly changing with content updates and deletions.
  • The web archiving tools are not perfect and can be quite technical. Capturing video or other rich media (flashing logos, audio, etc.) does not always work. This means crawls of websites can be incomplete.
  • New social media tools are being launched rapidly. We need to stay aware of the latest and greatest Web 2.0 tools and other technologies Smithsonian employees use.

The Smithsonian Institution launched its Home Page ( on the World Wide Web on May 8, 1995, Th The Smithsonian’s presence on the Web has come a long way from its main homepage launch in 1995. Marc Pachter*, who was Smithsonian counselor for electronic communications at the time, wrote about the launch of the Smithsonian homepage in Cultural Resource Management magazine in 1995. He wrote, “No one yet understands the full potential of this medium. Within our first 13 weeks we registered over 4 million ‘hits’ on the homepage …” Yet, he aptly added that Smithsonian audiences want to see more of what we have and to interact with us more. This certainly has proven true. *Pachter was the National Portrait Gallery's Director from 2000-2007, and worked at the Smithsonian for 33 years.

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.