The Collaborative Electronic Records Project
Email collections are more than just messages and can contain numerous attachments. As part of the Collaborative Electronic Research Project (CERP), potential format issues in email attachments needed to be identified according to SIA’s Electronic Records Program’s best practices. During the initial processing, the attachments were extracted (originals remain with source email), and file format identification applications were used to determine formats and possible obsolescence issues.
JHOVE and DROID are both useful file format identification tools used by the archival community. JHOVE provides robust metadata for a small set of standard-based file formats, while DROID handles a much larger range of formats. JHOVE required significantly more technical skills to install it. This is countered by DROID’s comparatively limited metadata output. Using both programs for assessments provides a good comparison mechanism.
SIA has developed a Java-based script that automates analyses of the attachments using both programs. The script generates: 1) a file log listing all the analyzed attachments; 2) a file list of the analyzed attachments and possible types determined by DROID and JHOVE for each; 3) outputs from the JHOVE modules and DROID; and 4) and a warnings file. This warnings file can contain the diagnosis from DROID when there is a possible file mismatch and JHOVE’s analysis as well on that file in question. All output files should be reviewed to get a thorough analysis.
A primary goal of developing this script was to reduce format analysis time by eliminating the need to manually run the attachments through DROID as well as each JHOVE module separately. The warnings file serves only as a starting point to make the review of questionable files easier by logging results from both programs in a simple text document that an archivist can zero in on to get an immediate handle on problematic files.
We offer this script for use by other archival organizations. Please be aware that neither SIA nor the Rockefeller Archive Center can be held liable for any problems. Use of the script is at your own risk. We welcome your questions and input.
This script was written by SIA intern Jacob Bartel.
* The FileList is not properly identifying the location of the file to be analyzed when that file is in the same directory as the script.
* JHOVE indicated a PDF file was “not well-formed” within the JHOVE PDF output file but DROID did not indicate a problem and the file was not listed in the warnings file.
* JHOVE indicated a PDF file was “not valid” within the JHOVE PDF output file but DROID did not indicate a problem and the file was not listed in the warnings file. Error reported by JHOVE was Invalid destination object.
DROID is licensed under BSD License for DROID (Digital Record Object Identification) v3.0. Copyright (c) 2008, The National Archives. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.