Smithsonian Institution Archives
  • Collections
  • Services
  • Smithsonian History
  • About
  • Education
  • Blog
  • Forums
  • Press
  • Audiences
  • Donate

The Bigger Picture: Visual Archives and the Smithsonian

A Peek into an Electronic Records Archivist’s Toolbox

by Lynda Schmitz Fuhrig on November 29, 2011

When it comes to electronic records there is no magic button that makes them readable or usable on a computer. Electronic records archivists rely on all types of hardware, software, and operating systems. Many pieces of software, which function as an archivist’s toolbox, can help files remain available or become usable again. Here is a small list of some open-source and/or freely available software we use at the Smithsonian Institution Archives. Keep in mind that tools are not perfect and should be used with caution. Don’t forget to have backups of your files. Before we incorporate a piece of software into our processes at the Archives, we research it by making sure it is from a reputable group and thoroughly test it on copy sample sets. This post is not an endorsement of any products listed by the Smithsonian Institution. 

The Collaborative Electronic Records Project (CERP) parser outputs XML preservation copies of email.

The CERP parser

From 2005–2008, the Smithsonian Institution Archives and the Rockefeller Archive Center conducted the grant-funded Collaborative Electronic Records Project. The two institutions researched the long-term preservation challenges of messages and attachments within email collections.

CERP also was able to work with another email project called Preservation of Electronic Mail Collaboration Initiative (EMCAP), and both groups co-developed an XML preservation schema for email accounts. Essentially this work resulted in the ability of taking an email account in a proprietary format, such as PST from Microsoft Outlook, and creating a XML preservation copy of the entire account of messages and attachments. XML was chosen as the preservation format because it is human-readable, open, and self-describing.

CERP developed a parsing tool written in the Smalltalk programming language that creates the XML preservation copy following the schema noted above that includes sender, date, subject, message body, etc. The CERP parser was created so small- to mid-sized organizations could download the software to use with their email account/s.

A crawl of the Archives of American Art’s website with Heritrix.

Heritrix

We have written about Heritrix previously on the blog. This tool crawls websites and creates preservation containers of the output known as WARCs (Web ARChive). The Archives uses Heritrix to crawl the nearly one hundred public websites maintained by the Smithsonian’s various museums, research centers, and other offices.

Benefits of Heritrix include:

  • WARCs are an international archival standard
  • WARCs contain useful information such as date, record id, content type, and other data
  • WARCs are easier to manage than hundreds of thousands of separate documents, pages, and assets from a website that was downloaded or copied
BWF MetaEdit allows metadata entry with audio files.

BWF MetaEdit

Our collections at the Archives also include audio, covering everything from Smithsonian concerts to workshop planning files to oral histories. These files are preserved as WAVs (Waveform Audio File Format), which is considered one audio preservation format because it is uncompressed; works in Windows, Mac, and Linux; and is widely used. BWF, or Broadcast Wave Format, is the European adaption of Microsoft’s WAV and contains embedded metadata, which makes it more desirable as a preservation option.

BWF MetaEdit, which was developed by the Federal Agencies Digitization Guidelines Initiative (FADGI), allows users to create Broadcast Wave files from WAVs. Metadata can be added through its graphical user interface (GUI) or command line to create a valid Broadcast Wave file. These metadata fields include organization name, description, origination time of the file, and the software used to create the original WAV.

JHOVE and DROID

JHOVE and DROID are both useful file format identification tools used by archives, libraries, and other organizations. JHOVE is a collaboration between JSTOR and the Harvard University Library while DROID was developed by The National Archives of the United Kingdom.

These tools can be used together in some cases to determine an unknown file format. For example, when we receive digital files from other Smithsonian offices, sometimes older files are missing the three letter identifying extension at the end of the file name. Without this information, it’s difficult to know whether a file called “budget” is a WordPerfect file or a spreadsheet.

The Archives also developed a Java-based script that automates analyses of digital files using both JHOVE and DROID. The script generates outputs and file lists that help an archivist determine possible issues, such as a file with the wrong identifying extension.

Note: This script uses older versions of JHOVE and DROID but newer versions are currently being tested at the Archives.

The Duke Data Accessioner tool assists with copying and analyzing digital files.

Duke Data Accessioner

Many electronic files come to the Archives on removable media (CDs, DVDs, and, yes, 3.5” diskettes), which require that we transfer the content to our backed-up servers for preservation and access. The Duke Data Accessioner (DDA) from Duke University is software that assists us with the initial work of ingesting (copying) the files off the media. After entering some information about the collection and media, the tool creates the same directory structure of the files from the media and copies the records. DDA also runs JHOVE and DROID (see above) against the files for analysis and creates an XML file of this output with some additional preservation metadata known as PREMIS.

Other tips

If you are interested in only viewing a file and not opening it, try searching viewer and old files on the Internet. For more information on detecting file formats via the Internet, search file identification. Some online tools will attempt to detect what a mystery file might be.

Software that enables digital files to last for the long term with authenticity and integrity intact can be a lifesaver. Nevertheless, they are not a replacement for copies, backups, and migrations to new software and hardware of important files.

Categories: Behind the Scenes, What Gets Saved
Tags: Web/Tech, Digitization, Behind the Scenes, Archives
Comments: View 2 comments, or Give us yours!
All comments are moderated and subject to approval. Further information is available in The Bigger Picture’s Commenting Guidelines.

Comments (2) – Leave a comment

Matt

Interesting read. One thing the article didn't talk about that would be interesting to know is how big that database is. I can't imagine trying to DBA that thing!!

I've started a MySQL database tracking my fantasy and romance ebooks just to learn more about databases and that thing was big but your tables must be terrabytes each.

Matt November 29, 2011 at 8:52 pm
  • reply
Lynda Schmitz Fuhrig

Actually, we manage our collections with a few tools, which include a Content Management System, electronic notification forms, and tracking software. All of these are backed up.

Lynda Schmitz Fuhrig December 15, 2011 at 3:30 pm
  • reply

Leave a comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
By submitting this form, you accept the Mollom privacy policy.

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.

Stay in touch!

Facebook Twitter Flickr YouTube SlideShare
Join our eNewsletter

About

Connecting you to America’s past with a behind-the-scenes exploration of the Smithsonian’s history, treasures, and the challenges that Archives face preserving collections. More details...

Smithsonian on Flickr Commons

Topics/Tags

  • See Here (611)
  • American History (542)
  • Science (429)
  • Archive (329)
  • Cities/Places (277)
  • Exhibitions (234)
  • Web/Tech (210)
  • Photo History (189)
  • Link Love (153)
  • Politics/Government (153)

Blog Roll

All Smithsonian blogs
American Historical Association Blog
American Institute of Conservation Blog
Archives Next
Archives of American Art
Around the Mall
Field Book Project
Hanging Together
Library of Congress Blogs
National Archives (US) Blogs
National Museum of American History, O say can you see?
Smithsonian Collections Blog
Smithsonian Libraries
Teaching American History

Categories

  • Collections in Focus (988)
  • What Gets Saved (337)
  • Behind the Scenes (212)
  • Smithsonian History (134)

Recent Posts

  • See Here: 5/17/2013
  • Link Love: 5/17/2013
  • Weird and Wonderful: The Surprising Mrs. Hilda Hempl Heller
  • Women in Science Wednesday: Anne Hagopian
  • Sneak Peek 5/15/2013

Monthly Archive

  • May 2013 (20)
  • April 2013 (26)
  • March 2013 (26)
  • February 2013 (26)
  • January 2013 (28)
  • December 2012 (26)
  • November 2012 (28)
  • October 2012 (32)
  • September 2012 (26)
  • August 2012 (31)
  • July 2012 (26)
  • June 2012 (27)
  • May 2012 (27)
  • April 2012 (27)
  • March 2012 (28)
  • February 2012 (27)
  • January 2012 (26)
  • December 2011 (31)
  • November 2011 (28)
  • October 2011 (35)
  • September 2011 (31)
  • August 2011 (35)
  • July 2011 (41)
  • June 2011 (43)
  • May 2011 (33)
  • April 2011 (40)
  • March 2011 (43)
  • February 2011 (35)
  • January 2011 (36)
  • December 2010 (42)
  • November 2010 (40)
  • October 2010 (44)
  • September 2010 (37)
  • August 2010 (39)
  • July 2010 (38)
  • June 2010 (37)
  • May 2010 (42)
  • April 2010 (44)
  • March 2010 (47)
  • February 2010 (40)
  • January 2010 (39)
  • December 2009 (43)
  • November 2009 (34)
  • October 2009 (11)
  • September 2009 (11)
  • August 2009 (12)
  • July 2009 (14)
  • June 2009 (10)
  • May 2009 (12)
  • April 2009 (14)
  • March 2009 (10)
  • January 2009 (1)
Smithsonian Institution Archives
eNewsletter Facebook Twitter Flickr Historypin YouTube SlideShare Browsealoud
Smithsonian Institution
  • Privacy
  • Copyright
  • Contact