This is a crawl or capture done of the Smithsonian Institution Traveling Exhibition’s website in January 2018. It serves as a snapshot of the website at that moment.

How Much is in That Terabyte?

2018 was a busy year for born-digital collections at the Archives

Recently the Digital Services team at the Archives met with a visiting professional from China who works with digital objects at her museum. She was amazed at how much we do in the Archives from sharing our history to conserving items to assisting researchers. The breadth and scope of our work involves every museum, research center, and office at the Smithsonian.

The Digital Services team had a very busy 2018 fiscal year, which ended September 30, when it comes to collections that contain born-digital materials (files not digitized or scanned). While it is simple to see rows of record storage boxes, it is sometimes hard to visualize the storage capacity of a gigabyte, terabyte, or even a petabyte. According to a blog post from the University of Oregon, a terabyte is about 250 movies, each with a two-hour running time, or 17,000 hours of music. One petabyte is 1,024 terabytes.

The Smithsonian Archives collected more than 1 TB of born-digital data in 2018, but it was not just video and audio. These records also included images, drawings, email collections, and web and social media content from across the Smithsonian.

It is important for us to track what and how much we collect. This information is useful for reporting to Smithsonian administration, projecting storage needs, and determining preservation workflows and priorities.

Even though we are well into the 21st century, some of the files we received this year date back to the 1980s. Accession number SIA 18-100, the National Museum of Natural History (U.S.), Division of Mammals Correspondence, 1934-1939, 1956-2017 collection, contains some files from 1986. One is a text document with an extension of .Z-M rather than the common .txt seen today. File format recognition tools like JHOVE and Droid help us determine formats that are not always obvious.

Three men in mariachi outfits hold instruments and walk. They are wearing hats and their faces are p

Here is a breakdown for Fiscal Year 2018:

  • Number of collections that contain born-digital materials – 147
  • Size – 1 TB or 1100 GB of original source digital material. This is before any preservation work is done.
  • Number of files – 219,987
  • Number of formats – More than 60. This includes more than 80,000 JPG files. Other formats include PDFs, Microsoft Word documents, Flash files, and a few Lotus 1-2-3 files from 1994.
  • Largest file – 6 GB video file from a Smithsonian symposium in 2014.


Screenshot of a Smithsonian website. The page is titled "American Democracy: A Great Leap of Faith."

More than ninety of the collections were captures of Smithsonian websites or social media accounts. Many of the web collections can be found at the Archive-It website.

Fiscal Year 2019, which began October 1, also is gearing up to be another busy year. We have thirty-one collections so far with born-digital materials.

Related Resources

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.