Eva Müller, Uwe Klosa, Peter Hansson, Stefan Andersson, Erik Siira: Using XML for Long-term Preservation

2. Long-term Preservation in the DiVA Project

Five Swedish universities cooperate within the DiVA project, which originated at the Uppsala University. The participants are the universities of Stockholm, Södertörn, Umeå, Uppsala and Örebro. The main goals of the project are to create a searchable archive for long-term preservation and to disseminate the scientific work of the five universities

Long-term preservation is only useful, if several copies of the archive exist and a persistent and unique identifier identifies every document. Therefore several projects were initiated in cooperation with the Royal Library in Stockholm. At the Royal Library in Stockholm National Bibliographic Numbers (NBN, see section siehe ) are made available to a URN resolution service<8> at the Royal Library in Stockholm. MARC21 records in MARC-XML are sent to the National Library (LIBRIS). These records contain the URN:NBN of the described document. The catalogue at the National Library and the archives at the Royal Library are likely to be well maintained and have longevity, so they are relatively safe places to deposit documents for long-term preservation.

And since the main purpose of the project initiated with the Royal Library is to create a copy of the local DiVA Archives<9>, check-summed file-packages are sent there (see section siehe ).

2.1 Uniform Resource Name (URN) and National Bibliographic Number (NBN)

A uniform resource name, or URN, is a unique and permanent identifier for electronic resources on the Internet. Unlike a uniform resource locator, i.e. an URL, an URN is a permanent identifier that cannot be changed over time. An URN cannot be assigned to other resources even if the mapped resource has ceased to exist. The national library of Sweden assigns URNs in the Swedish national bibliographic number domain (URN:NBN:se) to organisations and the public in Sweden.

The DiVA archive has been assigned the sub domains URN:NBN:se:X:diva where X stands for an abbreviated form of the participant in the project. Uppsala University is abbreviated uu (umu for Umeå University, oru for Örebro University, sh for Södertörn University, su for Stockholm University). To automatically give every published document in the DiVA archive an unique URN:NBN identifier a serial number is added to the sub domain, e.g. URN:NBN:uu:diva-3100.

To resolve a URN:NBN a resolution service has been developed and installed at the Royal Library in Stockholm. To resolve the example above the following URL can be used http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-3100. The Royal Library has published guiding principles for the use of URN:NBN in Sweden at http://www.kb.se/urn/riktlinjer.htm<10>.

In case of the closing of the DiVA archive at Uppsala University, the respective URN:NBN will be resolved to the National Library Archive. In this term the long-term preservation copy is directly connected to the resolution service.

2.2 The DiVA Archive

Today the DiVA Archive consists of metadata files conforming to the DiVA Document Format, full-text files in voluntary formats (PDF is the most common) and checksum files. The files are stored in dedicated folders in an ordinary file system. The files have controlled names building on the URN:NBN identifiers. The identifiers link metadata files to theirs respective full-text files. Our experience so far shows this is a convenient way to store data.

The archive has been developed according to the OAIS (Open Archival Information System) framework and reference model<11>. The administrative metadata in the DiVA Document Format refer to the OAIS model.

The structure of the archive is hierarchical and can be easily mapped to a file system or a native XML database. The root directory is called $archive_home and must be specified in the environment the archive system runs on. There is a file called readme in the root folder that describes the archive. The readme file contains information about the archive's structure, file name conventions, description of concepts, checksum algorithms<12> and references to documents stored in the archive that contain important information about the archive.

Each document in the archive has been assigned a unique and permanent URN:NBN identifier. The identifiers' constituent parts, which are separated by colons or hyphens, build up the hierarchical structure of the archive. For example the document with the identifier urn:nbn:se:uu:diva-1144 gives rise to the folder $archive_home/urn/nbn/se/uu/diva/1144 in the archive. This folder is also called a document root folder.

A document root folder contains a changeable metadata file and numbered folders, starting with 1, containing different manifestations of the same document. The changeable metadata file can be altered any time, both before and after manifestations are published. The metadata file conforms to the DiVA Document Format (see section siehe ).

The name of a manifestation folder refers to a number specified in the manifestation section of the metadata file. In addition to one or more full-text files a manifestation folder contains a copy of the changeable metadata file located in the document root folder. The copy is made automatically when a manifestation of the document is published. The metadata copy cannot be changed after the manifestation is published and therefore as opposed to changeable metadata is called unchangeable metadata. A manifestation folder can also include supplementary files, e.g. errata, and files (often images) that for example the full-text files link to.

Every file in the archive gets a checksum. A checksum is a number, calculated from the contents of a file, which is used to determine if the contents of a file are correct (i.e. to check a file's integrity). In the DiVA Archive checksums are stored in separate files.

The files stored in the archive are also described in the metadata file. During the project it was noticed that filename guidelines are likely to change over time. It is therefore essential that the file-properties, e.g. file formats, are stored in the metadata file together with the names of the files they describe. Properties described in the metadata file include filename, document type (full-text, supplement), file size, file format (abbreviation, version, identifiers to file format registry, description).

Figure 1: Structure of the DiVA Archive

After a certain period<13> from publishing, the document and its metadata are delivered as a package to the Royal Library for long-term preservation. This package contains only selected manifestations of the document, the DiVA Document Schema with all bibliographic metadata and administrative metadata. Each package is named after its URN:NBN identifier followed by the manifestation number it contains. This guarantees distinguishable and unique package names, e.g. urn:nbn:se:uu:diva-1144-2.


Footnotes:

<8>

This link can be used for resolving an URN:NBN: http://urn.kb.se/resolve?urn=...

<9>

See the agreement: http://www.ub.uu.se/diverse/avtal.pdf (in Swedish)

<10>

There is also a RFC published at http://www.ietf.org/rfc/rfc3188.txt

<11>

See: http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html

<12>

The algorithms are SHA and MD5

<13>

6 months as an first assumption



© This publication and its compilation in form and content is copyrighted. Every realization which is not explicitly allowed by copyright law requires a written agreement. Especially, this holds for reprography and processing / storing by electronic systems.

ETD Proceeding DTD
HTML - Version create: Tue May 20 15:50:59 2003