ANNIS: a search tool for multi-layer annotated corpora
Philosophische Fakultät II
ANNIS (see Dipper & Götze 2005; Chiarcos et al. 2008) is a flexible web-based corpus architecture for search and visualization of multi-layer linguistic corpora. By multi-layer we mean that the same primary datum may be annotated independently with (i) annotations of different types (spans, DAGs with labelled edges and arbitrary pointing relations between terminals or non-terminals), and (ii) annotation structures that possibly overlap and/or conflict hierarchically. In this paper we present the different features of the architecture as well as actual use cases for corpus linguistic research on such diverse areas as information structure, learner language and discourse level phenomena. The supported search functionalities of ANNIS2 include exact and regular expression matching on word forms and annotations, as well as complex relations between individual elements, such as all forms of overlapping, contained or adjacent annotation spans, hierarchical dominance (children, ancestors, left- or rightmost child etc.) and more. Alternatively to the query language, data can be accessed using a graphical query builder. Query matches are visualized depending on annotation types: annotations referring to tokens (e.g. lemma, POS, morphology) are shown immediately in the match list. Spans (covering one or more tokens) are displayed in a grid view, trees/graphs in a tree/graph view, and pointing relations (such as anaphoric links) in a discourse view, with same-colour highlighting for coreferent elements. Full Unicode support is provided and a media player is embedded for rendering audio files linked to the data, allowing for a large variety of corpora. Corpus data is annotated with automatic tools (taggers, parsers etc.) or taskspecific expert tools for manual annotation, and then mapped onto the interchange format PAULA (Dipper 2005), where stand-off annotations refer to the same primary data. Importers exist for many formats, including EXMARaLDA (Schmidt 2004), TigerXML (Brants & Plaehn 2000), MMAX2 (Müller & Strube 2006), RSTTool (O’Donnell 2000), PALinkA (Orasan 2003) and Toolbox (Stuart et al. 2007). Data is compiled into a relational DB for optimal performance. Query matches and their features can also be exported in the ARFF format and processed with the data mining tool WEKA (Witten & Frank 2005), which offers implementations of clustering and classification algorithms. ANNIS2 compares favourably with search functionalities in the above tools as well as other corpus search engines (EXAKT, http://www.exmaralda.org/exakt.html, TIGERSearch, Lezius,2002, CWB, Christ 1994) and other frameworks/architectures (NITE, Carletta et al. 2003, GATE, Cunningham, 2002).
Dateien zu dieser Publikation