How Much is in That Terabyte

Recently the Digital Services team at the Archives met with a visiting professional from China who works with digital objects at her museum. She was amazed at how much we do in the Archives from sharing our history to conserving items to assisting researchers. The breadth and scope of our work involves every museum, research center, and office at the Smithsonian.

The Digital Services team had a very busy 2018 fiscal year, which ended September 30, when it comes to collections that contain born-digital materials (files not digitized or scanned). While it is simple to see rows of record storage boxes, it is sometimes hard to visualize the storage capacity of a gigabyte, terabyte, or even a petabyte. According to a blog post from the University of Oregon, a terabyte is about 250 movies, each with a two-hour running time, or 17,000 hours of music. One petabyte is 1,024 terabytes.

The Smithsonian Archives collected more than 1 TB of born-digital data in 2018, but it was not just video and audio. These records also included images, drawings, email collections, and web and social media content from across the Smithsonian.

It is important for us to track what and how much we collect. This information is useful for reporting to Smithsonian administration, projecting storage needs, and determining preservation workflows and priorities.

Even though we are well into the 21st century, some of the files we received this year date back to the 1980s. Accession number SIA 18-100, the National Museum of Natural History (U.S.), Division of Mammals Correspondence, 1934-1939, 1956-2017 collection, contains some files from 1986. One is a text document with an extension of .Z-M rather than the common .txt seen today. File format recognition tools like JHOVE and Droid help us determine formats that are not always obvious.

Three men in mariachi outfits hold instruments and walk. They are wearing hats and their faces are p

Here is a breakdown for Fiscal Year 2018:

Number of collections that contain born-digital materials – 147
Size – 1 TB or 1100 GB of original source digital material. This is before any preservation work is done.
Number of files – 219,987
Number of formats – More than 60. This includes more than 80,000 JPG files. Other formats include PDFs, Microsoft Word documents, Flash files, and a few Lotus 1-2-3 files from 1994.
Largest file – 6 GB video file from a Smithsonian symposium in 2014.

Screenshot of a Smithsonian website. The page is titled "American Democracy: A Great Leap of Faith."

More than ninety of the collections were captures of Smithsonian websites or social media accounts. Many of the web collections can be found at the Archive-It website.

Fiscal Year 2019, which began October 1, also is gearing up to be another busy year. We have thirty-one collections so far with born-digital materials.

Related Resources

"A Day for Electronic Records," by Lynda Schmitz Fuhrig, The Bigger Picture, Smithsonian Institution Archives
"Defining 'Born Digital': An Essay by Ricky Erway," Online Computer Library Center
"Web Archiving Update, October 2014," by Jennifer Wright, The Bigger Picture, Smithsonian Institution Archives
"What Does an Electronic Records Archivist Do?," by Lynda Schmitz Fuhrig, The Bigger Picture, Smithsonian Institution Archives

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

How Much is in That Terabyte?

Related Resources

Leave a Comment

Search Google Appliance

How Much is in That Terabyte?

Related Resources

Leave a Comment