Korpuslinguistik
Refine
Year of publication
Document Type
- Part of a Book (14)
- Conference Proceeding (8)
- Article (4)
Has Fulltext
- yes (26)
Keywords
- Korpus <Linguistik> (25)
- Deutsch (4)
- Forschungsdaten (4)
- Korpusanalyseplattform (KorAP) (4)
- Rumänisch (4)
- Sprachdaten (4)
- Benutzeroberfläche (3)
- Deutsches Referenzkorpus (DeReKo) (3)
- Kontrastive Linguistik (3)
- KorAP (3)
Publicationstate
- Veröffentlichungsversion (14)
- Zweitveröffentlichung (10)
- Postprint (1)
Reviewstate
- Peer-Review (14)
- (Verlags)-Lektorat (9)
Publisher
- de Gruyter (6)
- Editura Academiei Române (3)
- European Language Resources Association (ELRA) (3)
- IDS-Verlag (2)
- Leibniz-Institut für Deutsche Sprache (2)
- CECL Papers 1 (1)
- European Language Resources Association (1)
- European language resources association (ELRA) (1)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (1)
- IDS-Verlag; Leibniz-Institut für Deutsche Sprache (IDS) (1)
The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.
We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given user’s permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
Die Diskurslinguistik hat sich in den letzten Jahren als eine linguistische Teildisziplin etabliert, die in transtextuellen Untersuchungen über sprachliche Muster gesamtgesellschaftlich rele-vante Denk- und Vorstellungswelten rekonstruiert. Die Digitalisierung hat nicht nur unsere Gesellschaft grundlegend verändert und neue Kommunikationsformen und innovative kulturelle Praktiken geprägt, sondern auch das diskurslinguistische Arbeiten maßgeblich beein-flusst. So war die Etablierung der Diskurslinguistik sowie auch der diskursorientierten Lexikographie geprägt durch die Engführung mit computergestützten Methoden (Bubenhofer 2009, Teubert/Čermáková 2007, Halliday et al. 2004), die große Textsammlungen für Diskursanalysen zugänglich machen. Da diskursanalytische Forschung in foucaultscher Tradition nicht am Einzelbeleg interessiert ist, sondern mit kontextuellen Mustern und intertextuellen Verweisstrukturen arbeitet, bietet eine korpusgestützte Analyse eine produktive Ausgangsbasis für Diskursuntersuchungen. Dies gilt insbesondere für die Diskurslexikographie, bei der auf breiter Datenbasis Wörterbücher zu kulturhistorischen Diskursen erstellt werden.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.
Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian
(2019)
The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.