Digital Preservation
A Quantitative Evaluation of Dissemination-Time Preservation Metadata
J.A. Smith and M.L. Nelson. Proceedings of the 12th European Conference on Digital Libraries. September 2008.
One of many challenges facing web preservation efforts is the lack of metadata available for web resources. In prior work, we proposed a model that takes advantage of a site’s own web server to prepare its resources for preservation. When responding to a request from an archiving repository, the server applies a series of metadata utilities, such as Jhove and Exif, to the requested resource. The output from each utility is included in the HTTP response along with the resource itself. This paper addresses the question of feasibility: Is it in fact practical to use the site’s web server as a just-in-time metadata generator, or does the extra processing create an unacceptable deterioration in server responsiveness to quotidian events? Our tests indicate that (a) this approach can work effectively for both the crawler and the server; and that (b) utility selection is an important factor in overall performance.
Repository replication using SMTP and NNTP
M.L. Nelson, J.A. Smith, and M. Klein. Proceedings of the 2006 International Conference on Digital Government Research. May 2006.
We describe our progress on NSF ISS 0455997, "Shared Infrastructure Preservation Models". The focus of our efforts is to evaluate different preservation models based on Internet infrastructure that sites already have. Specifically, we investigate replicating the contents of a repository using the Simple Mail Transport Protocol ("email") and the Network News Transfer Protocol ("news").
Repository Replication Using NNTP and SMTP
J.A. Smith, M. Klein, and M.L. Nelson. Proceedings of European Conference on Digital Libraries. Alicante, Spain. September 2006.
We present the results of a feasibility study using shared, existing, network-accessible infrastructure for repository replication. We utilize the SMTP and NNTP protocols to replicate both the metadata and the content of a digital library, using OAI-PMH to facilitate management of the archival process. We investigate how dissemination of repository contents can be piggybacked on top of existing email and Usenet traffic. Long-term persistence of the replicated repository may be achieved thanks to current policies and procedures which ensure that email messages and news posts are retrievable for evidentiary and other legal purposes for many years after the creation date. While the preservation issues of migration and emulation are not addressed with this approach, it does provide a simple method of refreshing content with unknown partners for smaller digital repositories that do not have the administrative resources for more sophisticated solutions.
How Much Preservation Do I Get If I Do Absolutely Nothing?
M. Klein, F. McCown, J.A. Smith, and M.L. Nelson. Proceedings of Media Production 2006. Berlin, 2007.
To date, most of the focus regarding digital preservation has been on removing copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are sub ject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of two of our ongoing research pro jects that focus using the “web infrastructure” to provide
preservation capabilities for web pages. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. Both approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future.
Reconstructing Websites for the Lazy Webmaster
F. McCown, J.A. Smith, M.L. Nelson, and J. Bollen. Technical Report. Old Dominion University. December 2005.
Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, "lazy" webmasters or concerned third parties may be able to recover some of their website from the Internet Archive. Other pages may also be salvaged from commercial search engine caches. We introduce the concept of "lazy preservation"- digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches). We present Warrick, a tool to automate the process of website reconstruction from the Internet Archive, Google, MSN and Yahoo. Using Warrick, we have reconstructed 24 websites of varying sizes and composition to demonstrate the feasibility and limitations of website reconstruction from the public Web infrastructure. To measure Warrick's window of opportunity, we have profiled the time required for new Web resources to enter and leave search engine caches.
Using The Web Infrastructure To Preserve Web Pages
M.L. Nelson, F. McCown, J.A. Smith, and M. Klein. International Journal on Digital Libraries. July, 2007.
To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).
Integrating Preservation Functions into the Web Server
J.A. Smith. Dissertation (Ph. D., Computer Science). Old Dominion University. June, 2008.
Digital preservation of the World Wide Web poses unique challenges, different from the preserva-
tion issues facing professional Digital Libraries. The complete list of a website’s resources cannot
be cited with confidence, and the HyperText Transfer Protocol (HTTP) provides a bare minimum
of metadata with each resource transfer – HTTP is optimized for access today rather than tomor-
row. In short, the Web suffers from a counting problem and a representation problem. Refreshing
the bits, migrating from an obsolete file format to a newer format, and other classic digital preser-
vation problems also affect the Web. As digital collections devise solutions to these problems, the
Web will also benefit. But the core World Wide Web problems of Counting and Representation
need a targeted solution.
As the host of web content, the web server is uniquely positioned to assist in the preservation of
the resources it serves. It recognizes the resources it has, and knows what kind of resources they
are. This dissertation presents research in which preservation functions have been integrated into
the web server itself to produce archive-ready versions of the website’s resources. The proposed
approach addresses the Counting Problem through the use of Sitemaps, created from a combina-
tion of crawling, Sitemap tools, and log analysis. The Representation Problem is addressed by a
preservation-preparation module installed on the web server. The module enables each resource
to be packaged together with the output from a variety of relevant metadata utilities, creating the
aforementioned archive-ready version of the resource. The CRATE Model defines a simple XML
structure for the creation and delivery of such resources.
A series of experiments which evaluated CRATE, Sitemaps, and extemporaneous metadata anal-
ysis of resources are presented, along with a technical review of the MODOAI web server module
which acts as the preservation agent. The feasibility of this approach is demonstrated by a quanti-
tative analysis of its use in a commercial web testing environment.
Creating Preservation-Ready Web Resources
J.A. Smith and M.L. Nelson. D-Lib Magazine. January/February 2008.
There are innumerable departmental, community, and personal web sites worthy of long-term preservation but proportionally fewer archivists available to properly prepare and process such sites. We propose a simple model for such everyday web sites which takes advantage of the web server itself to help prepare the site's resources for preservation. This is accomplished by having metadata utilities analyze the resource at the time of dissemination. The web server responds to the archiving repository crawler by sending both the resource and the just-in-time generated metadata as a straight-forward XML-formatted response. We call this complex object (resource + metadata) a CRATE. In this paper we discuss modoai, the web server module we developed to support this approach, and we describe the process of harvesting preservation-ready resources using this technique.