Eva Müller, Uwe Klosa, Peter Hansson, Stefan Andersson, Erik Siira: Using XML for Long-term Preservation |
The long-term preservation of digital objects includes a variety of challenges. The DiVA project has faced some of the technical problems concerning storage and storage media to guarantee the maintenance and the security of the DiVA Archive. But the focus of the DiVA project has been ensuring the future use and understanding of the digital objects in the archive can be assured. There is no guarantee that it will be possible to use and to understand these objects in the distant future, but there are ways to increase the chances likelihood of success.
This assumption was the starting point for the discussions about the design of the DiVA Archive and the DiVA Workflow. We tried to find a practical and convenient way to minimize risks for data loss, especially in the context of migration of the entire document and the connected metadata to other formats and media. Another important condition was to find a practical solution that is applicable for large-scale production and that can be part of an automated workflow. As a part of this workflow we established a connection to the National Library Archive, so that both the metadata of the digital objects and the digital objects themselves could be exchanged.
XML was discussed early on within the DiVA developer team as a possible format for long-term preservation. XML is an open and established notation. XML documents are in a human-readable text format and internationalised character sets are supported. These characteristics facilitate data migration and the documents are likely to have longevity. Therefore the decision was made to use XML as a format for storing descriptive and administrative metadata, as well as for the complete content of the digital objects. Thus, the DiVA Document Format was created.
The DiVA Document Format was developed to be compatible with a number of commonly used metadata standards relevant to electronic documents. At the present, however, we need to focus on the management of two other types of content - images and formulas. The integration of these into the DiVA Document Format is still under development. Formulas in abstracts are already stored as MathML, but the MathML must be created manually and inserted into the abstract<14>. For a production workflow it is absolutely necessary to have a tool that can construct these formulas automatically.
The latest results in our development are the transformation from MS Word documents to the DocBook format with the help of Open Office. Open Office is able to load MS Word documents and it stores documents using an internal XML. Our development team has created an XSL style sheet that transforms the Open Office XML into the DiVA Document Format. Open Office XML already stores formulas as MathML, so it may be possible, in the near, future to use Open Office to move formulas directly into the DiVA Document Format.
Because of the DiVA Document Format and the DiVA Archive, the first fundamental steps of the construction of an archive for long-term preservation have been taken. The usage of URN:NBN as an unique identifier and the exchange of metadata and archive-files with the National Library Archive, were the next important steps. Though the first implementation of the entire archiving workflow will not be completed until autumn of this year, electronic format will already be the primary mode of publication for some theses during next semester. This decision to move so quickly - made by faculty and the authors themselves - demonstrates their confidence in the DiVA Archive.
The outcome of the work, we have presented here, is not only the result of the work of our development team. It is also the result of discussions with other developers, librarians and researchers. We would especially thank our reference group<15> for their feedback and the support they continue to give us.
Tim DiLauro gave us useful feedback on this paper.
| Footnotes: | |
|---|---|
|
With the help of the features in Open Office http://www.openoffice.org | |
|
http://publications.uu.se/epcentre/diverse/refgrupp.html |
© This publication and its compilation in form and content is copyrighted. Every realization which is not explicitly allowed by copyright law requires a written agreement. Especially, this holds for reprography and processing / storing by electronic systems.
|
ETD Proceeding DTD |
HTML - Version create: Tue May 20 15:50:59 2003 |