A Peek into an Electronic Records Archivist’s Toolbox

When it comes to electronic records there is no magic button that makes them readable or usable on a computer. Electronic records archivists rely on all types of hardware, software, and operating systems. Many pieces of software, which function as an archivist’s toolbox, can help files remain available or become usable again. Here is a small list of some open-source and/or freely available software we use at the Smithsonian Institution Archives. Keep in mind that tools are not perfect and should be used with caution. Don’t forget to have backups of your files. Before we incorporate a piece of software into our processes at the Archives, we research it by making sure it is from a reputable group and thoroughly test it on copy sample sets. This post is not an endorsement of any products listed by the Smithsonian Institution. 

The Collaborative Electronic Records Project (CERP) parser outputs XML preservation copies of email.

The CERP parser

From 2005–2008, the Smithsonian Institution Archives and the Rockefeller Archive Center conducted the grant-funded Collaborative Electronic Records Project. The two institutions researched the long-term preservation challenges of messages and attachments within email collections.

CERP also was able to work with another email project called Preservation of Electronic Mail Collaboration Initiative (EMCAP), and both groups co-developed an XML preservation schema for email accounts. Essentially this work resulted in the ability of taking an email account in a proprietary format, such as PST from Microsoft Outlook, and creating a XML preservation copy of the entire account of messages and attachments. XML was chosen as the preservation format because it is human-readable, open, and self-describing.

CERP developed a parsing tool written in the Smalltalk programming language that creates the XML preservation copy following the schema noted above that includes sender, date, subject, message body, etc. The CERP parser was created so small- to mid-sized organizations could download the software to use with their email account/s.

A crawl of the Archives of American Art's website with Heritrix.

Heritrix

We have written about Heritrix previously on the blog. This tool crawls websites and creates preservation containers of the output known as WARCs (Web ARChive). The Archives uses Heritrix to crawl the nearly one hundred public websites maintained by the Smithsonian’s various museums, research centers, and other offices.

Benefits of Heritrix include:

  • WARCs are an international archival standard
  • WARCs contain useful information such as date, record id, content type, and other data
  • WARCs are easier to manage than hundreds of thousands of separate documents, pages, and assets from a website that was downloaded or copied

BWF MetaEdit allows metadata entry with audio files.

BWF MetaEdit

Our collections at the Archives also include audio, covering everything from Smithsonian concerts to workshop planning files to oral histories. These files are preserved as WAVs (Waveform Audio File Format), which is considered one audio preservation format because it is uncompressed; works in Windows, Mac, and Linux; and is widely used. BWF, or Broadcast Wave Format, is the European adaption of Microsoft’s WAV and contains embedded metadata, which makes it more desirable as a preservation option.

BWF MetaEdit, which was developed by the Federal Agencies Digitization Guidelines Initiative (FADGI), allows users to create Broadcast Wave files from WAVs. Metadata can be added through its graphical user interface (GUI) or command line to create a valid Broadcast Wave file. These metadata fields include organization name, description, origination time of the file, and the software used to create the original WAV.

JHOVE and DROID

JHOVE and DROID are both useful file format identification tools used by archives, libraries, and other organizations. JHOVE is a collaboration between JSTOR and the Harvard University Library while DROID was developed by The National Archives of the United Kingdom.

These tools can be used together in some cases to determine an unknown file format. For example, when we receive digital files from other Smithsonian offices, sometimes older files are missing the three letter identifying extension at the end of the file name. Without this information, it’s difficult to know whether a file called “budget” is a WordPerfect file or a spreadsheet.

The Archives also developed a Java-based script that automates analyses of digital files using both JHOVE and DROID. The script generates outputs and file lists that help an archivist determine possible issues, such as a file with the wrong identifying extension.

Note: This script uses older versions of JHOVE and DROID but newer versions are currently being tested at the Archives.

The Duke Data Accessioner tool assists with copying and analyzing digital files.

Duke Data Accessioner

Many electronic files come to the Archives on removable media (CDs, DVDs, and, yes, 3.5” diskettes), which require that we transfer the content to our backed-up servers for preservation and access. The Duke Data Accessioner (DDA) from Duke University is software that assists us with the initial work of ingesting (copying) the files off the media. After entering some information about the collection and media, the tool creates the same directory structure of the files from the media and copies the records. DDA also runs JHOVE and DROID (see above) against the files for analysis and creates an XML file of this output with some additional preservation metadata known as PREMIS.

Other tips

If you are interested in only viewing a file and not opening it, try searching viewer and old files on the Internet. For more information on detecting file formats via the Internet, search file identification. Some online tools will attempt to detect what a mystery file might be.

Software that enables digital files to last for the long term with authenticity and integrity intact can be a lifesaver. Nevertheless, they are not a replacement for copies, backups, and migrations to new software and hardware of important files.

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.