Even though our physical office in Washington, D.C., is closed to staff and visitors due to the COVID-19 pandemic, the Smithsonian Institution Archives staff is able to work remotely on some projects. For those of us who work with the born-digital collections, this means we are continuing to focus on web archiving, reference requests for some accessible materials, and cataloging and metadata projects, while making sure various systems can be securely accessed.
It also is a time to catch up with some backlogs and projects that have been put on the back burner, while also dealing with and adapting to the everyday challenges of technology limitations away from the office.
One of those ongoing projects is prep work for sharing more of our born-digital materials online. Winter intern Julie Rockwell recently wrote about some workflow and access ideas for the Archives. We are exploring what could be some engaging and interesting materials to post online and how we would do it.
How does this work start though? Our in-house database called DArcInfo (Digital Archives Information System) of born-digital collection items helps with this review. With it, we are able to sort collections by year, restriction status, types of files, and other parameters to narrow down to the best candidates.
A few considerations for sharing born-digital files online include:
- Format – is the file in a format that is accessible and accurately renderable? For instance, a WordPerfect file requires specific software for viewing since it is a proprietary format. At the Archives, the preservation format for WordPerfect files is PDF/A or PDF. A PDF also serves as the access copy for that WordPerfect file. On the other hand, another file might be a mystery, in that our tools are not able to identify what it is at this time (could be corrupt or its age/rarity is not recognized by current format identification tools). It is a file that has bit-level preservation, but does not have an access copy available. It is not a candidate at this time to be posted.
- Context – is there enough information within the file (embedded caption or accurate file name) or other details? For example, a CD can have incorrect labels or no labels at all and the files are named IMG001, IMG002, IMG003, etc. In some cases, also viewing the associated finding aid and other items (paper or digital) for the collection can provide some clues.
It can be fun, though, to post a mystery photo to see if the public can identify people, objects, or places.
- Privacy issues – unrestricted collections also can present challenges with sensitive data that should not be public. It is possible a few files out of thousands might contain personally identifiable information (PII). A careful review that involves both software tools and human intervention is necessary. There also might be intellectual property rights issues in some cases.
The example here is from the National Air and Space Museum’s Office of Special Events. This collection has both paper and digital records and documents the National Air and Space Museum’s Trophy Awards. The images were on CDs labeled “NASM Awards” and the word- processing files were on 3.5” floppies with labels referring to 1995 and 1996 Trophy Award scripts.
Viewing the files from the CDs in file explorer as thumbnails and the .tif extensions makes it clear that they are images. There are no captions to identify the event or the people, though, with this set that is possibly from 2005.
The people in the photograph above are the NASM employees being recognized for their work, in addition to the separate Trophy Awards. Unfortunately, there are no names for anyone and no photographer credit either on the CD or the paper folder it was in or embedded within the files. I do recognize Gen. J.R. “Jack” Dailey, the former NASM director, in some of them, however, we’ll need to rely on a review of the entire collection or other related collections or talk to NASM staff to identify the others. Another potential setback is that the metadata for the image has a creation date of 2002, meaning it’s possible that the photographs are not from a 2005 event and the CD was mislabeled.
The other files from this accession require more digging. This is where format tools assist in this detective work. Making use of JHOVE and DROID in this instance, the files are identified as WordPerfect 5.1. As noted above, the preservation and access copy for these file types is PDF/A or PDF since they can be viewed more easily in an online environment.
More examples are WordPerfect files from programs held by the Smithsonian African American Association. Again, it isn’t immediately clear from a glance what these files might be. Note the creative extensions (or none), which was commonplace in the 1980s and 1990s, that were used: .mem for memo, .98 for a file created in 1998, and .ins for Instructions. It is unclear what .NTC implies. This is another reason file extensions aren’t always an accurate indicator what a file might be. The MAILFLYR file is about a mentor group program that was set for January 30, 1998.
Stay tuned as we continue to work to share more of our born-digital materials.
- Digital Preservation Challenges and Solutions, Smithsonian Institution Archives
- Accessing Digital Archives, UNC-Chapel Hill, University Libraries
- “Finding the Digital Treasures,” by Lynda Schmitz Fuhrig, The Bigger Picture, Smithsonian Institution Archives