CSV log of email that was processed in DArcMail that shows sender, recipient, date, subject, MessageID, and hash. This is from the Enron email dataset.

Announcing the Latest Release of DArcMail

The Archives has released its latest version of its open-source email preservation software suite for anyone to use.

As an email preservation pioneer, the Archives is pleased to announce its latest release of its open-source software DArcMail (Digital Archiving for eMail) available on Github. Most of the testing and use has been done in Windows, but it should also operate in Mac and Linux environments. The availability on Github now allows for easier version control and sharing with others.

View of email containers that need special software to open.

Since email can be in a variety of formats that can include proprietary versions, messages and native attachments are at risk for being inaccessible over time. While email archiving can be important for legal reasons, there also is interest from researchers into digital correspondence to learn about everything from business operations to social networks to email user practices in general. Libraries, archives, and cultural heritage organizations continue to research ways to make email messages and attachments available for study in ten, twenty, or one hundred years from now. 

DArcMail has two components: DArcMailXML creates the preservation output and the other piece allows users to search an email account. Both require that the email collection (or single message) be in the MBOX format from an account, meaning the original email format may need to be converted. DArcMail preserves an email account in the EAXS XML preservation format for long-term accessibility either through the command line or a user interface. The EAXS XML preservation schema was co-developed by the State Archives of North Carolina and the Smithsonian Institution Archives during previous email preservation projects. The other component of DArcMail lets a user search for messages by keyword, date, and sender, as well as being able to export a subset of emails into the MBOX format.

Software showing emails with the word Announcement found in a search.

The current version is built in Python 3 and uses SQLite, which is portable and can process emails faster than the prior DArcMail versions that used a MySQL database. Every processed email collection has its own SQLite database file, which allows it to be moved to another directory on a computer or shared with another user/computer who wants to run DArcMail.

DArcMailXML creates the master XML preservation copy and does not require SQLite. This current version of DArcMail now creates an XML for each MBOX file from an email collection. For instance, if an account has an Inbox directory and Sent Items directory, there should be an Inbox MBOX file and Sent Items MBOX file for DArcMail processing. Previously there was one XML file for the entire account, but those would get unwieldy and hard to open and manage as email accounts can easily exceed 50 GB in size with thousands of emails and attachments. The XML file retains the folder structure; header and other metadata; and messages and their attachments. 

Attachments from an account are encoded in the MBOX file. DArcMail processes all attachments by storing them separately from the XML preservation file with the email message noting the attachment file name, content type, and the directory of the file. Previously, attachments were stored externally based on whether their size exceeded 25 kb.

View of email information that includes sender, recipient, date, subject, MessageID, and hash.

A CSV log is generated for every XML file with subject line, to, and from for the messages. Every message and attachment has a hash (a specific string of numbers and letters) generated as well for authenticity. Both DArcMail components detect the number of duplicate emails across the account based on the MessageID, which every email has in its header that most email applications hide, and reports that information in a log file, as well as the number of emails processed and how long it took.

An email message displayed with formatting.

This is an exciting time for email archiving and preservation since more projects have started since we last visited this topic in early 2020. These include ePADD+, which is looking at adding preservation functionality into the ePADD tool from Stanford University, and Mailbag, which is exploring email preservation in many formats and extending the Bagit specification that is used to “bag” or package files. Email Archiving in PDF: From Initial Specification to Community of Practice, which is phase two of a previous email archiving with PDF project, is another project that the Smithsonian Institution Archives is continuing to participate in.

A special thank you to Archives volunteer Carl Schaefer for his DArcMail development work.

Related Resources

Leave a Comment

Produced by the Smithsonian Institution Archives. For copyright questions, please see the Terms of Use.