Logo of Humboldt-Universität zu BerlinLogo of Humboldt-Universität zu Berlin
edoc-Server
Open-Access-Publikationsserver der Humboldt-Universität
de|en
Header image: facade of Humboldt-Universität zu Berlin
View Item 
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • International Conference on Dublin Core and Metadata Applications DC-2008 (Humboldt-Universität zu Berlin, 22.09.2008 - 26.09.2008)
  • View Item
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • International Conference on Dublin Core and Metadata Applications DC-2008 (Humboldt-Universität zu Berlin, 22.09.2008 - 26.09.2008)
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.
All of edoc-ServerCommunity & CollectionTitleAuthorSubjectThis CollectionTitleAuthorSubject
PublishLoginRegisterHelp
StatisticsView Usage Statistics
All of edoc-ServerCommunity & CollectionTitleAuthorSubjectThis CollectionTitleAuthorSubject
PublishLoginRegisterHelp
StatisticsView Usage Statistics
View Item 
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • International Conference on Dublin Core and Metadata Applications DC-2008 (Humboldt-Universität zu Berlin, 22.09.2008 - 26.09.2008)
  • View Item
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • International Conference on Dublin Core and Metadata Applications DC-2008 (Humboldt-Universität zu Berlin, 22.09.2008 - 26.09.2008)
  • View Item
2008-08-08Konferenzveröffentlichung DOI: 10.18452/1252
Automatic Metadata Extraction from Museum Specimen Labels
Heidorn, P. Bryan
Wei, Qin
This paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.
Files in this item
Thumbnail
heidorn.pdf — Adobe PDF — 392.7 Kb
MD5: 672959d240b032dba31095b69f12cfa3
Cite
BibTeX
EndNote
RIS
InCopyright
Details
DINI-Zertifikat 2019OpenAIRE validatedORCID Consortium
Imprint Policy Contact Data Privacy Statement
A service of University Library and Computer and Media Service
© Humboldt-Universität zu Berlin
 
DOI
10.18452/1252
Permanent URL
https://doi.org/10.18452/1252
HTML
<a href="https://doi.org/10.18452/1252">https://doi.org/10.18452/1252</a>