Show simple item record

2008-08-08Konferenzveröffentlichung DOI: 10.18452/1252
Automatic Metadata Extraction from Museum Specimen Labels
dc.contributor.authorHeidorn, P. Bryan
dc.contributor.authorWei, Qin
dc.contributor.editorGreenberg, Jane
dc.date.accessioned2017-06-15T12:17:47Z
dc.date.available2017-06-15T12:17:47Z
dc.date.created2008-08-08
dc.date.issued2008-08-08
dc.identifier.urihttp://edoc.hu-berlin.de/18452/1904
dc.description.abstractThis paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.eng
dc.language.isoeng
dc.publisherHumboldt-Universität zu Berlin
dc.subjectautomatic metadata extractioneng
dc.subjectmachine learningeng
dc.subjectHidden Markov Modeleng
dc.subjectNaïve Bayeseng
dc.subjectDarwin Coreeng
dc.titleAutomatic Metadata Extraction from Museum Specimen Labels
dc.typeconferenceObject
dc.identifier.urnurn:nbn:de:kobv:11-10092676
dc.identifier.doihttp://dx.doi.org/10.18452/1252
local.edoc.container-titleInternational Conference on Dublin Core and Metadata Applications
local.edoc.container-titleMetadata for Semantic and Social Applications 22 - 26 September 2008, Berlin
local.edoc.container-titleDC-2008
local.edoc.pages12
local.edoc.type-nameKonferenzveröffentlichung
local.edoc.container-typeconference
local.edoc.container-type-nameKonferenz
local.edoc.container-eventInternational Conference on Dublin Core and Metadata Applications - Metadata for Semantic and Social Applications 22 - 26 September 2008, Berlin, DC-2008, 22.09.2008 - 26.09.2008, Humboldt-Universität zu Berlin, pp 57-68
local.edoc.container-firstpage57
local.edoc.container-lastpage68

Show simple item record