Here at the Smithsonian Institution Archives, we take pride in preserving the Institution’s history, including its sizable web presence. While various offices at the Smithsonian create and back up the contents of their websites, the Archives also crawls each website using Heritrix, an open-source tool created by the Internet Archive, to capture content in an archival format. Our aim is to preserve the ABCs of digital objects: appearance, behavior, and content. We take care to tailor crawl configurations to each specific website to capture as much of its ABCs as possible while adhering to our collections policy. Sometimes, though, the structure of the site itself makes a perfect crawl difficult or impossible.
Based on our experience, and because the preservation process of a digital object starts at its creation, here are some suggestions for web developers that can help ensure that the websites they create and maintain will be easier to crawl, can remain accessible, and will be preserved.
1. Follow accessibility standards
Adhering to accessibility standards renders your site usable by everyone and accessible by more devices, including Heritrix and the Wayback Machine. Here are some useful resources:
- the W3C’s Web Accessibility Initiative (WAI)
- a Best Practices guide from the University of Illinois’s Center for Information Technology and Web Accessibility
- description of and standards for Section 508, a law that requires that federal agencies’ electronic and information technology to be accessible to people with disabilities
2. Avoid proprietary formats for important content or provide alternate versions
There’s no assurance that proprietary formats used in web design will stick around in the long run. If the software manufacturer retires the product or closes, it will be much harder in the future for archives and libraries to display the digital object, since they’ll need to obtain a copy of software that might be old, rare, or difficult to implement. Instead, stick to open standards like HTML and CSS. If you decide to use Flash, offer a text-only version, too, and strive to provide equal content and experience.
3. Maintain stable URLs and redirect when necessary
Avoid linkrot! Linkrot is the tendency of links on the internet to point to resources that are no longer available. Carefully plan and implement a URL design scheme with a policy of persistence. In our test crawls, we’ve come across websites with links that are as much as 40% broken. When updating a website, be sure to provide redirects for relocated documents. Your users will appreciate having continued access to the information. And in the same vein...
4. Design navigation carefully and include a sitemap
Our crawler is usually set to six “hops”, which means it will grab content six links away from a given seed URL. We won’t capture pages buried more than six levels deep. To help the crawler (and your readers) discover your entire website, provide a sitemap. For large collections of documents which may be listed over several or many pages, provide a “view all” link, too.
5. Allow browsing of collections, not just searching
Sometimes archived websites contain a lot of good content, but it’s not accessible through the archival interface because the search function doesn’t work offline. If your website contains a searchable collection of documents or images, make sure it’s also browsable, e.g. by arranging images by genre. This way, a crawler can at least capture content by categories — and current users can wander through the collections without having to know what they’re looking for.
Your users, present and future, will thank you for making your site more accessible and crawlable, and you’ll have the added bonus of being more discoverable to other crawlers like Google.
The ephemeral nature of the web is both a blessing and a curse. While it’s easy to produce and publish digital content, it’s just as easy to delete it or lose it. By designing with preservation in mind, you help web archivists safeguard your work for the future. It’s part of our cultural legacy!
Archiving the Smithsonian’s presence on the Internet, By Lynda Schmitz Fuhrig, Electronic Records Archivist