The Bigger Picture: Visual Archives and the Smithsonian
Digital Video Preservation: Identifying Containers and Codecs
In addition to a rich collection of analog moving image material currently being digitized, the Smithsonian Institution Archives (SIA) accessions large quantities of born-digital video from various hard drives, CDs, DVDs, and websites across the Institution. And just as digitization is a method of preserving moving image content before it degrades on an analog carrier, digital material must be retrieved from optical media before hardware failure or the degradation of such devices renders the content unplayable. The digital video files that are selected to be archived, like other electronic records at the Archives, are then incorporated into digital preservation workflows to insure that the files will remain playable for future generations.
Ensuring the longevity of digital video at the Smithsonian begins with inventorying a collection of video files, and capturing the technical information (codec, resolution, frame rate, etc.) related to a video file so its format can be identified. This process provides the means for assessing what may be at risk for obsolescence, as well as to determine what needs to be prioritized for preservation. Finally, because there are thousands of video files already accessioned, and with the potential for that number to grow exponentially in the coming years, understanding what’s in the Archives’ collections is key in developing priorities, better management practices, and preservation strategies for digital video.
Digital video files wrap a video and audio stream in a container or wrapper that is typically identified by a video file’s extension, which is important for archivists to keep track of. And due to the size of uncompressed video, streams are often compressed to more manageable sizes via a compressor/decompressor program called a codec. Media player applications like Windows Media, RealPlayer, QuickTime, and VLC will detect a codec type and access a program to decode the video and audio streams for playback. Some codec types are lossless, meaning the compression is mathematically reversible and no data is lost in the compression process. Other compression techniques are lossy and are effective means of providing high-quality, access copies. However, because data loss weighs heavily on an archivist’s conscience, any digital video format used for preservation will utilize lossless compression or no compression at all.
This video from Smithsonian Institution Archives Accession 11-014 is an excerpt of a larger video file found on a CD-ROM created in 2000 by the National Museum of Natural History. Before being converted for playback on the web, the video was in an MPG file container and was compressed with the MPEG-2 codec, which is playable in most media players. Initially, it was converted to MOV but would not play in YouTube. This version is now in WebM for YouTube playback.
For my internship at the Archives, I inventoried a variety of video files, taking note of each files’s container and codec types. The inventory yielded almost ten thousand video files with over twenty different container types, fifty video codec types, and twenty audio codec types, all of which were tested for playback in Windows Media, RealPlayer, QuickTime (Mac and Windows) and VLC media players. As it turned out, some 20 percent of those video files would not play in those four media player applications. Surprisingly, some of the more relatively modern codec types were more susceptible to playback issues than the older, more obscure codecs, which appear to have more established support in consumer media players. As you might imagine, a file that can only play in one media player is at greater risk for format and software obsolescence.
This excerpted video, from Smithsonian Institution Archives Accession 05-173, was accessioned as result of efforts to preserve the Smithsonian 150th anniversary website. Created in 1997, I could at first only get this file to play both video and sound in RealPlayer, but was eventually able to get it to play in YouTube.
I used various software applications to analyze and identify video and audio streams and their respective codecs, but each application had its own nomenclature for identifying a video file and its streams. Terms like “format,” “format name,” “format profile,” “compressor name,” “codec,” and “codec ID” were all used to identify a codec. Resources like Wikipedia and MultimediaWiki turned out to be helpful in keeping a consistent scheme for the identification of codecs and addressing discrepancies between identification tools.
Automating the capture of all this technical metadata is crucial in accessioning digital video into the Archive, especially as this data will serve as a key tool in managing and accessing these assets, and making preservation related decisions throughout the lives of these assets in the Archives.