Refine
Document Type
- Conference Proceeding (3)
- Article (1)
Has Fulltext
- yes (4)
Is part of the Bibliography
- no (4)
Keywords
- Annotation (4) (remove)
Publicationstate
- Postprint (1)
- Veröffentlichungsversion (1)
Reviewstate
Publisher
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.
This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of Standardization (ISO) on developing standards for linguistic resources. The individual projects have been conducted at German collaborative research centres at the Universities of Potsdam, Hamburg and Tübingen, where the sustainability work was coordinated.
We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s specific research needs. SPLICR also provides an interface that enables users to query and to visualise corpora. The project in which the system is being developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annota-tions. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collections of language resources.