Eva Müller, Uwe Klosa, Peter Hansson, Stefan Andersson, Erik Siira: Using XML for Long-term Preservation

1. XML as Long-term Preservation Format

One of the objectives of the DiVA project is to explore the possibility of using XML as a format for long-term archiving.

There are several advantages of using XML encoded documents for long-term archiving. XML is an open and established notation. XML documents are in a human-readable text format and internationalised character sets are supported. These characteristics facilitate data migration and the documents are likely to have longevity. For these reasons XML seemed like a good choice, but to ensure success, the practical use of XML in different parts of the system was evaluated before a decision about the design was made.

In the DiVA project XML is not only for archiving. It is also used for the communication between different processes within the system and for the internal communication in the development team. It also helps to validate data with help of an XML schema. The dynamic web interface is built on XML and XSLT.

1.1 XML Schema

XML Schema provides a means for defining the structure, content and semantics of XML documents. XML Schema is an XML based alternative to the XML Document Type Definition (DTD). Because the primary reason for using XML was to support long-term archiving, the most popular DTDs and schemas for documents namely DocBook and TEI were evaluated. Limitations regarding the metadata descriptions needed in the DiVA project were found.

Because of the need to combine administrative metadata, descriptive metadata and content, a new schema was developed that meets the needs of the DiVA project. This schema combines the DocBook schema (derived from the DocBook DTD) for the textual parts of the document with the bibliographic metadata and administrative metadata for long-term preservation.

XML Schema was chosen over XML DTD because it is written in XML and supports many data types, self-defined data types and different namespaces. The support for different data types offers several advantages. It is possible to describe permissible document content, to validate the correctness of data, to define restrictions on data (data facets), to define data formats (data patterns) and to convert between different data types. It is also easier to work with data coming from a database.

During the development, it was noticed that XML Schema facilitated the communication between the developers by providing a simple mechanism for writing formal specifications of subsystem interfaces.

1.2 Comparison of DocBook and TEI

TEI<2> and DocBook<3> are two widely used recommendations for encoding textual material in electronic form. These two recommendations were compared to find which is most appropriate and convenient to use when representing full-text documents in the DiVA Archive.

A logical unit, i.e. a combination of XML elements and/or XML attributes that have a certain well-defined meaning, can be expressed differently in TEI and DocBook. A logical unit that consists of only one well-defined element in DocBook often is composed by both a general element and attribute in the TEI representation. Attribute values are not defined in the TEI recommendation and therefore have to be defined locally. Therefore it is likely that others would not correctly interpret a TEI encoded document without any agreements.

Elements that define the structure of documents, e.g. headers, chapters, lists and tables are more specifically defined in DocBook than in TEI. For publication of documents like PhD theses or scientific papers it is therefore more convenient to use DocBook because relevant structure elements are well defined. But if a text should be marked-up in detail both semantically and structurally, for example in order to create scholarly archives of diverse kinds of historical sources or for linguistic purposes, the more general TEI scheme would be a better choice.

The main purpose in the DiVA project is to store the structure of the contents of the documents and not to store the semantics. Therefore DocBook was chosen to mark up the content.

Element

TEI

DocBook

Heading 1

<div1 type="chapter" n='1'>

<head n="1">Heading 1</head>

</div1>

<chapter id="1">

<title> Heading 1</title>

</chapter>

Superscript

<hi rend="sup">text</hi>

<superscript>text</superscript>

Lists

<list type=“...“></list>

<orderdlist numeration=“...“>...</orderdlist>

1.3 DiVA Document Format

DiVA Document Format - defined by an XML Schema - version 1.0 consists of 99 elements<4>. Administrative elements are combined with descriptive elements to make it possible to describe a publication in the same XML document file that contains its content. Many element names exist in both singular and plural form. The plural form is always used to name container elements. A container element contains one or more elements in its corresponding singular form. For example <creators> contains one or more <creator> elements, <titles> contains <title> elements and so on. The container elements group elements that contain the same type of information. These container elements can also group elements that contain closely related information<5>. This makes it easier for human readers to find information quickly in the document. Machines can also benefit from the fact that the distance between interrelated information is kept short.

One of the advantages of XML Schema over DTD is that it has many in-built data types such as numerical values and dates. When applicable the predefined XML Schema data types have been used in the DiVA Document Format. But there are exceptions when the built-in types are not an appropriate choice. An example is the element xs:date. xs:date represents a date as defined in ISO 8601. The lexical form is CCYY-MM-DD. Since specifying xs:date would require a year (CCYY), month (MM) and date (DD), it was necessary to define a different date format that would allow one to specify the year, even when month and date are unknown.

XML Schema supports user-defined complex types. A complex type describes complex structures built by elements and other types. The most commonly used complex types in DiVA Document Format are personType (see appendix siehe ) and organisationType (see appendix siehe ) which define a person and an organisation respectively.

All XML elements defined in the XML Schema have English-language names. There are both general elements and specific elements defined by the XML Schema. General elements facilitate introducing new concepts to the XML Schema without changing it (promote scalability of concepts). In spite of this fact it is not always suitable to generalize well-defined concepts too much as in the case of creators and contributors. An option would have been to use a person or an organisation element with attributes. The documents should also be easy to read for humans and therefore both general and specific elements have been defined.

The order of the elements was not the focus while developing the DiVA Document Format, though, there were some exceptions. The <properties> element has to be the first child element if applicable. A property should be near the parent element it describes. The child elements <specifics> have to be in ascending order of generality, i.e. the specific groups have to come before the more general groups of elements. In object-oriented programming languages, this facilitates the creation of raw XML from objects in such a way that the transformation process can start to translate specific information stored in the subclasses and end up translating the general data in the super classes.

DiVA Document Format defines one root element, which is called <documents>. This makes it possible to save more than one document in one XML file, which is needed in some applications in the DiVA project<6>. But for archiving purposes each document is stored in a separate file containing a <documents> element with exact one <document> element.

The <properties>-element (and its child <property>) is used in several constructs and is used to give the parent element arbitrary attributes (not to be confused with XML attributes). Each property is defined independently of the others. If dependencies are crucial, another construct has to be used. An XML attribute is sometimes used on the property element to make it clear what the property stands for. This construct has several advantages: new properties can be easily integrated into the schema without changing the main structure of the definition (scalable), every level in an hierarchical structure can be described in a plain fashion as properties, and when a property has an XML attribute it is easy for a machine to find the right section in the XML document.

<identifiers> is widely used in the documents conforming to DiVA Document Format. The <identifiers> element gives the parent element one or more identifiers. Each <identifier> element consists of a (<properties>,<value>)-pair. The properties describe the identifier and the identifier itself is specified under the <value> tag. The following identifiers are in use today: “local“, which is an identifier that is only used within an organisation; “internal“, which currently binds XML data to a relational database system; “ISSN“ and “ISBN“, which are used to identify series and publications; “URI“ (uniform resource identifier), formatted as a web link (URL); “URN:NBN“, a special national unique identifier; and country and language codes according to ISO639 and ISO3166, respectively.

The URN:NBN identifier is used to map electronic resources to URLs and as a primary key of the publications stored in DiVA.

<specifics> is a container element that contains child elements that are not generally applicable. It is convenient to put all elements that only exists for a certain publication type into a place where they can be found easily by both humans and machines.

<manifestations> is an element that contains different manifestations or instances of the same publication<7>. Because a document can be stored in many different formats, both physical and electronic, each of them must be described individually. The formats can, for example, have different identifiers, be member of different series, be published and distributed at certain dates by different organisations and/or persons. Today most of the doctoral theses stored in the DiVA system have two manifestations, a physical book and an electronic PDF file.

The manifestation element contains also metadata about migration from one format to another format. If, for example, a PDF manifestation was migrated to a newer version of PDF, a new manifestation is created with information about the original manifestation, which is stored in the archive, too.


Footnotes:

<1>

http://urn.kb.se/resolve

<2>

See: http://www.tei-c.org/

<3>

See: http://www.docbook.org/

<4>

See: http://publications.uu.se/schema/1.0/diva.xsd

<5>

This can be elements which only exist for a specific type of documents.

<6>

Delivery of search results to the search interface on the website and the application to maintain the archive.

<7>

Some inspiration has been gathered from the work concerning Functional Requirements for Bibliographic Records, FRBR, by IFLA. See: http://www.ifla.org/VII/s13/frbr/frbr.htm



© This publication and its compilation in form and content is copyrighted. Every realization which is not explicitly allowed by copyright law requires a written agreement. Especially, this holds for reprography and processing / storing by electronic systems.

ETD Proceeding DTD
HTML - Version create: Tue May 20 15:50:59 2003