Another Year of Collecting Online History

Another year has passed since we last wrote about website collecting during the COVID-19 pandemic. In the last 12 months, we have continued to document the Smithsonian Institution’s activities during this challenging time as 2021 still demonstrated the importance of archival web collections.

Websites and social media archives are treasure troves of information that sometimes is not offered elsewhere or is in a physical location difficult to access or travel to, especially during the past few years. For instance, Saving Ukrainian Cultural Heritage Online (SUCHO) is a global volunteer effort capturing cultural heritage websites and other online materials from Ukraine before they possibly disappear during Russia’s invasion of the country. Cultural heritage in both physical and digital forms is important to save.

Screenshot of the Vaccines and US archived website that shows a banner noting it was archived August

New websites and projects launched at the Smithsonian reflect the current state of society: Vaccines and Us (April 2021), Our Shared Future: Reckoning with Our Racial Past (summer 2021), and the Smithsonian’s yearlong kickoff marking its 175th anniversary (August 2021). Websites also change appearance over time, and it’s key to document as much as possible of these sites for the historical record. The Anacostia Community Museum and the Architectural History & Historic Preservation Division are just two recently redesigned websites, which have been archived with both their previous displays and their new versions.

Previous Pause Next

1 of 2

Screenshots of Architectural History & Historic Preservation Division archived website that show

The Architectural History & Historic Preservation Division archived website before and after its redesign November 8, 2021, and December 3, 2021.

Like everyone else across the globe, the Smithsonian closely followed the pandemic’s toll, which meant altered hours and temporary closures at the museums for the health and safety of staff and visitors in response to surges in cases. This resulted in focused and smaller collecting on a more frequent schedule, known as crawls, to capture that information as well. Instead of crawling an entire museum website, the target was the main homepage and/or the visit page noting the revised visitor hours.

Previous Pause Next

1 of 2

Screenshot of the Archive-It website of crawled Smithsonian websites with metadata.

The Archive-It website shows crawled Smithsonian websites and associated metadata.

We also were able to update some of our metadata entries that go along with the collected websites on the Archive-It website. This additional information should help researchers and staff with some queries/searches. The metadata is in the Dublin Core standard and includes creator (office or museum) of the website, date of the first crawl, and an identifier, which is the collection number/s assigned to the crawled site (or URL) by the Archives. For instance, if someone wants to find the collected websites created by the National Postal Museum, the researcher can select that under the “Creator” listing on the left to view the seven websites. This searchable metadata is just another way for someone to discover these collections.

Screenshot of Netlytic website that shows tweets from Smithsonian Secretary Lonnie Bunch's Twitter a

There also was more work with other web capturing tools as well to supplement what is archived with Archive-It to make sure we are getting as much as we can. Conifer, Webrecorder applications that include Browsertrix, and Netlytic have become a standard part of our web archiving software toolbox. Webrecorder can capture some portions of internal websites that the Archive-It tool cannot access and Netlytic collects tweets of specific Smithsonian hashtags like #Smithsonian175 or #BecauseofHerStory. We also are continuing to monitor other web archiving initiatives that are being developed.

From March 2021 to March 2022, we crawled 800 URLs (some were repeats over time with the changing museum hours), known as seeds. This tally does not include other applications we used to capture content. The Archive-It process involves doing a test crawl of the site, reviewing the test, saving the test or doing another, and then reviewing the saved test and doing additional crawls known as patches to retrieve other items/pages that were not captured during the first round. Scoping or tweaking can help exclude content not wanted. If a test is not done, the crawl could result in capturing too much data not pertinent to the website like unrelated YouTube videos that were collected as part of the test as it worked through the URL seed.

While we are preparing for our return to the physical office soon, our web collecting efforts will continue to be part of our mission to document and share Smithsonian history.

Note: The Smithsonian does not endorse any software applications.

Related Resources

Documenting the Smithsonian’s Pandemic Response by Jennifer Wright, The Bigger Picture
Ukrainian Websites Are Going Dark. Archivists Are Trying to Save Them. by Clare Woodcock, Vice
International Internet Preservation Consortium

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Search Google Appliance

Another Year of Collecting Online History

Leave a Comment