Refine
Year of publication
Document Type
- Part of a Book (12)
- Conference Proceeding (12)
- Article (1)
- Doctoral Thesis (1)
Keywords
- Annotation (26) (remove)
Publicationstate
- Veröffentlichungsversion (17)
- Postprint (4)
- Zweitveröffentlichung (3)
Reviewstate
Publisher
Twenty-two historical encyclopedias encoded in TEI: a new resource for the Digital Humanities
(2020)
This paper accompanies the corpus publication of EncycNet, a novel XML/TEI annotated corpus of 22 historical German encyclopedias from the early 18th to early 20th century. We describe the creation and annotation of the corpus, including the rationale for its development, suggested methodology for TEI annotation, possible use cases and future work. While many well-developed annotation standards for lexical resources exist, none can adequately model the encyclopedias at hand, and we therefore suggest how the TEI Lex-0 standard may be modified with additional guidelines for the annotation of historical encyclopedias. As the digitization and annotation of historical encyclopedias are settling on TEI as the de facto standard, our methodology may inform similar projects.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.
This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of Standardization (ISO) on developing standards for linguistic resources. The individual projects have been conducted at German collaborative research centres at the Universities of Potsdam, Hamburg and Tübingen, where the sustainability work was coordinated.
We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s specific research needs. SPLICR also provides an interface that enables users to query and to visualise corpora. The project in which the system is being developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annota-tions. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collections of language resources.
Der vorliegende Band befasst sich mit dem Stand und der Entwicklung von Forschungsinfrastrukturen für die germanistische Linguistik und einigen angrenzenden Bereichen. Einen zentralen Aspekt dabei bildet die Notwendigkeit, Kooperativität in der Wissenschaft im institutionellen Sinne, aber auch in Hinsicht auf die wissenschaftliche Praxis zu organisieren. Dies geschieht in Verbunden als Kooperationsstrukturen, wobei Sprachwissenschaft und Sprachtechnologie miteinander verbunden werden. Als zentraler Forschungsressource kommen dabei Korpora und ihrer Erschließung durch spezielle, linguistisch motivierte Informationssysteme besondere Bedeutung zu. Auf der Ebene der Daten werden durch Annotations- und Modellierungsstandards die Voraussetzung für eine nachhaltige Nutzbarkeit derartiger Ressourcen geschaffen.
Im Zentrum der Dissertation steht der Begriff Informationsmodellierung oder genauer der Begriff der "textuellen Informationsmodellierung", wobei auf einer bereits vorgeschlagenen Unterscheidung einer primären und einer sekundären Ebene der Informationsstrukturierung aufgebaut wird. Der Gegenstand der primären Ebene sind die textuellen Daten selbst sowie ihre Strukturierung, wohingegen die sekundäre Ebene beschreibt, wie die für die primären Ebenen verwendeten Regelwerke mit alternativen Regelwerken in Beziehung gesetzt werden können. Der Einteilung in eine primäre und eine sekundäre Informationsstrukturierung wird in der Dissertation das Konzept der multiplen Informationsstrukturierung nebengeordnet. Dieses Konzept ist so zu verstehen, dass die primäre Ebene bei Bedarf vervielfacht wird - jedoch bezieht sich jede dieser Ebenen auf dieselbe Datengrundlage. Hierbei ergeben sich auch Auswirkungen auf die sekundäre Informationsstrukturierung. Die Informationsmodellierung erfolgt mit Auszeichnungssprachen. Die Standard Generalized Markup Language (SGML) stellt hierfür einen Rahmen dar, jedoch wurde dieser Formalismus seit seiner 1986 erfolgten Standardisierung nicht nur weiterentwickelt, sondern es wurde mit der Extensible Markup Language (XML) im Jahr 1998 eine wesentlich einfachere Untermenge dieser Sprache definiert, die zudem das derzeitige Zentrum weiterer Entwicklungen auf dem Gebiet der Auszeichnungssprachen darstellt. Der entwickelte Ansatz zur Modellierung linguistischer Information basiert auf der Extensible Markup Language (XML), wobei die weitergehenden Möglichkeiten von SGML selbstverständlich ebenfalls dargestellt und diskutiert werden. Mittels XML können Informationen, die sich nicht in bestimmten Hierarchien (mittels mathematischer Bäume) strukturieren lassen, nicht in einer natürlichen Weise repräsentiert werden. Eine Lösung dieses Problems liegt in der Aufteilung der Strukturierung auf verschiedene Ebenen. Diese neue Lösung wird dargestellt, diskutiert und modelliert.
In this paper, we present the Multiple Annotation approach, which solves two problems: the problem of annotating overlapping structures, and the problem that occurs when documents should be annotated according to different, possibly heterogeneous tag sets. This approach has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. The files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) are described. These representations serve as a base for several applications.