Korpuslinguistik
Refine
Year of publication
Document Type
- Conference Proceeding (28)
- Part of a Book (13)
- Article (4)
- Report (1)
Has Fulltext
- yes (46)
Keywords
Publicationstate
- Veröffentlichungsversion (32)
- Zweitveröffentlichung (8)
- Postprint (1)
Reviewstate
- (Verlags)-Lektorat (25)
- Peer-Review (10)
Publisher
- European Language Resources Association (ELRA) (11)
- European Language Resources Association (3)
- Extreme Markup Languages Conference (3)
- University of Illinois (3)
- European language resources association (ELRA) (2)
- Narr (2)
- Association for Computational Linguistics (1)
- Berlin-Brandenburgische Akademie der Wissenschaften (1)
- Clarin (1)
- ELRA (1)
We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given user’s permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.
The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.
The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
This paper introduces the recently started DRuKoLA-project that aims at providing mechanisms to flexibly draw virtual comparable corpora from the German Reference Corpus DeReKo and the Reference Corpus of Contemporary Romanian Language CoRoLa in order to use these virtual corpora as empirical basis for contrastive linguistic research.
Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLARIN
(2020)
CLARIN is a European Research Infrastructure providing access to language resources and technologies for researchers in the humanities and social sciences. It supports the use and study of language data in general and aims to increase the potential for comparative research of cultural and societal phenomena across the boundaries of languages and disciplines, all in line with the European agenda for Open Science. Data infrastructures such as CLARIN have recently embarked on the emerging frameworks for the federation of infrastructural services, such as the European Open Science Cloud and the integration of services resulting from multidisciplinary collaboration in federated services for the wider domain of the social sciences and humanities (SSH). In this paper we describe the interoperability requirements that arise through the existing ambitions and the emerging frameworks. The interoperability theme will be addressed at several levels, including organisation and ecosystem, design of workflow services, data curation, performance measurement and collaboration. For each level, some concrete outcomes are described.
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
Forschungsdatenmanagement in den Geisteswissenschaften am Beispiel der germanistischen Linguistik
(2013)
Die Kernaufgabe des Instituts für Deutsche Sprache (IDS) ist die Erforschung und Dokumentation der deutschen Sprache. Dazu sammelt und archiviert das IDS einen umfangreichen Bestand an Forschungsprimärdaten in Form von Korpora der geschriebenen und gesprochenen Sprache sowie Sekundärdaten, wie zum Beispiel lexikographische Ressourcen. Dieser Beitrag gibt einen Überblick über den Datenbestand des IDS und die laufenden Forschungskooperationen im Bereich der Langzeitarchivierung. In diesem Kontext wird das im Aufbau befindliche Langzeitarchivdes IDS mit seiner Architektur, den zugrundeliegenden Prinzipien zur Daten- und Metadatenmodellierung sowie den daraus abgeleiteten Erfassungsprozessen vorgestellt. Der Beitrag schließt ab mit einem Ausblick auf die Herausforderungen und Perspektiven des Forschungsdatenmanagements aus Sicht der germanistischen Linguistik.