CRATE

Research into producing self-archiving web resources. This is my dissertation focus area. CRATE is a complex object model where a web server delivers both preservation metadata and the resource itself in response to a single GET request -- i.e., a self-describing resource.

Using OAI-PMH Resource Harvesting and MPEG-21 DIDL for Digital Preservation

Published: 
January 2007

J.A. Smith and M.L. Nelson. 2nd International Conference on Open Repositories. San Antonio, TX, USA. January, 2007.

Summary: 

We propose involving the web server in the preservation process through “mod_oai”, an Apache module that (1) addresses the counting problem by using OAI-PMH to unambiguously list all canonical URIs at a website and (2) addresses the representation
problem by providing an archive-ready representation of the web resource with a complex object format (MPEG-21 Digital Item Declaration Language (DIDL)) that captures both metadata generated by the web server at dissemination time and by the
repository post-crawl.

BibTeX: 

Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination

Published: 
June 2007

J.A. Smith and M.L. Nelson. Proceedings of JCDL 2007. Vancouver, BC, Canada. June 2007.

Publication: 
Summary: 

HTTP and MIME, while sufficient for contemporary web
page access, do not provide enough forensic information to
enable the long-term preservation of the resources they de-
scribe and transport. But what if the originating web server
automatically provided preservation metadata encapsulated
with the resource at time of dissemination ? Perhaps the in-
gestion process could be streamlined, with additional foren-
sic metadata available to future information archeologists.
We have adapted an Apache web server implementation of
OAI-PMH which can utilize third-party metadata analysis
tools to provide a metadata-rich description of each resource.
The resource and its forensic metadata are packaged to-
gether as a complex ob ject, expressed in plain ASCII and
XML. The result is a CRATE: a self-contained preservation-
ready version of the resource, created at time of dissemina-
tion.

BibTeX: 

A Quantitative Evaluation of Dissemination-Time Preservation Metadata

Published: 
September 2008

J.A. Smith and M.L. Nelson. Proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries (ECDL'08). Aarhus, Denmark. September, 2008.

Publication: 
Summary: 

One of many challenges facing web preservation efforts is the lack of metadata available for web resources. In prior work, we proposed a model that takes advantage of a site’s own web server to prepare its resources for preservation. When responding to a request from an archiving repository, the server applies a series of metadata utilities, such as Jhove and Exif, to the requested resource. The output from each utility is included in the HTTP response along with the resource itself. This paper addresses the question of feasibility: Is it in fact practical to use the site’s web server as a just-in-time metadata generator, or does the extra processing create an unacceptable deterioration in server responsiveness to quotidian events? Our tests indicate that (a) this approach can work effectively for both the crawler and the server; and that (b) utility selection is an important factor in overall performance.

BibTeX: 

Efficient, Automatic Web Harvesting

Published: 
November 2006

M.L. Nelson, J.A. Smith, I. Garcia del Campo, H. Van de Sompel and X. Liu. Proceedings of ACM WIDM 2006.

Publication: 
Summary: 

There are two problems associated with conventional web
crawling techniques: a crawler cannot know if all resources
at a non-trivial web site have been discovered and crawled
(“the counting problem”) and the human-readable format of
the resources are not always suitable for machine processing
(“the representation problem”). We introduce an approach
that solves these two problems by implementing support for
both the Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) and MPEG-21 Digital Item Declaration Language (DIDL) into the web server itself. We
present the Apache module “mod oai”, which can be used
to address the counting problem by listing all valid URIs at a
web server and efficiently discovering updates and additions
on subsequent crawls. Our experiments indicated comparable performance for initial crawls, and dramatic increases
in update speed mod oai can also be used to address the
representation problem by providing “preservation ready”
versions of web resources aggregated with their respective
forensic metadata in MPEG-21 DIDL format.

BibTeX: 

CRATE: A Simple Model for Self-Describing Web Resources.

Published: 
June 2007

J.A. Smith and M.L. Nelson. Proceedings of the 7th International Web Archiving Workshop IWAW'07. June 2007.

Publication: 
Summary: 

If not for the Internet Archive’s efforts to store periodic
snapshots of the web, many sites would not have any preservation prospects at all. The barrier to entry is too high
for everyday web sites, which may have skilled webmasters
managing them, but which lack skilled archivists to preserve
them. Digital preservation is not easy. One problem is the
complexity of preservation models, which have specific metadata and structural requirements. Another problem is the
time and effort it takes to properly prepare digital resources
for preservation in the chosen model. In this paper, we propose a simple preservation model called a CRATE, a complex ob ject consisting of undifferentiated metadata and the
resource byte stream. We describe the CRATE complex object and compare it with other complex-ob ject models. Our
target is the everyday, personal, departmental, or community web site where a long-term preservation strategy does
not yet exist.

BibTeX: