Refine
Year of publication
Document Type
- Conference Proceeding (18)
- Part of a Book (12)
- Article (1)
- Doctoral Thesis (1)
Keywords
- Annotation (32) (remove)
Publicationstate
- Veröffentlichungsversion (23)
- Postprint (4)
- Zweitveröffentlichung (3)
Reviewstate
Publisher
- Institut für Deutsche Sprache (6)
- European Language Resources Association (ELRA) (2)
- Springer (2)
- de Gruyter (2)
- ACL (1)
- Association for Computational Linguistics (1)
- Berlin-Brandenburgische Akademie der Wissenschaften (1)
- Bielefeld University (1)
- Extreme Markup Languages Conference (1)
- Gardez! Verlag (1)
Making CONCUR work
(2005)
The SGML feature CONCUR allowed for a document to be simultaneously marked up in multiple conflicting hierarchical tagsets but validated and interpreted in one tagset at a time. Alas, CONCUR was rarely implemented, and XML does not address the problem of conflicting hierarchies at all. The MuLaX document syntax is a non-XML syntax that enables multiply-encoded hierarchies by distinguishing different “layers” in the hierarchy by adding a layer ID as a prefix to the element names. The IDs tie all the elements in a single hierarchy together in an “annotation layer”. Extraction of a single annotation layer results in a well-formed XML document, and each annotation layer may be associated with an XML schema. The MuLaX processing model works on the nodes of one annotation layer at a time through Xpath-like navigation. CONCUR lives!
This paper deals with the problem of how to interrelate theory-specific treebanks and how to transform one treebank format to another. Currently, two approaches to achieve these goals can be differentiated. The first creates a mapping algorithm between treebank formats. Categories of a source format are transformed into a target format via a given set of general or language-specific mapping rules. The second relates treebanks via a transformation to a general model of linguistic categories, for example based on the EAGLES recommendations for syntactic annotations of corpora, or relying on the HPSG framework. This paper proposes a new methodology as a solution for these desiderata.
This paper describes a corpus of Japanese task-oriented dialogues, i.e. its data, annotations, analysis methodology and preliminary results for the modeling of co-referential phenomena. Current corpus based approaches to co-reference concentrate on textual data from English or other European languages. Hence, the emerging language-general models of co-reference miss input from dialogue data of non-European languages. We aim to fill this gap and contribute to a model of co-reference on various language-specific and language-general levels.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.
Research today is often performed in collaborated projects composed of project partners with different backgrounds and from different institutions and countries. Standards can be a crucial tool to help harmonizing these differences and to create sustainable resources. However, choosing a standard depends on having enough information to evaluate and compare different annotation and metadata formats. In this paper we present ongoing work on an interactive, collaborative website that collects information on standards in the field of linguistics as a means to guide interested researchers.
The paper discusses two topics: firstly an approach of using multiple layers of annotation is sketched out. Regarding the XML representation this approach is similar to standoff annotation. A second topic is the use of heterogeneous linguistic resources (e.g., XML annotated documents, taggers, lexical nets) as a source for semiautomatic multi-dimensional markup to resolve typical linguistic issues, dealing with anaphora resolution as a case study.