Refine
Document Type
- Part of a Book (5)
- Conference Proceeding (2)
- Article (1)
Has Fulltext
- yes (8)
Keywords
- Korpusanalyseplattform (KorAP) (8) (remove)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (5)
- Peer-Review (1)
Publisher
Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.
Die Korpusanalyseplattform KorAP ist von Grund auf sprachenunabhängig konzipiert. Dies gilt sowohl in Bezug auf die Lokalisierung der Benutzeroberfläche als auch hinsichtlich unterschiedlicher Anfragesprachen und der Unterstützung fremdsprachiger Korpora und ihren Annotationen. Diese Eigenschaften dienen im Rahmen der EuReCo Initiative aktuell besonders der Bereitstellung weiterer National- und Referenzkorpora neben DeReKo. EuReCo versucht, Kompetenzen beim Aufbau großer Korpora zu bündeln und durch die Verfügbarmachung vergleichbarer Korpora quantitative Sprachvergleichsforschung zu erleichtern. Hierzu bietet KorAP inzwischen, neben dem Zugang durch die Benutzeroberfläche, einen Web API Client an, der statistische Erhebungen, auch korpusübergreifend, vereinfacht.
Linguistic query systems are special purpose IR applications. We present a novel state-of-the-art approach for the efficient exploitation of very large linguistic corpora, combining the advantages of relational database management systems (RDBMS) with the functional MapReduce programming model. Our implementation uses the German DEREKO reference corpus with multi-layer
linguistic annotations and several types of text-specific metadata, but the proposed strategy is language-independent and adaptable to large-scale multilingual corpora.
KorAP, die neue Korpusanalyseplattform des IDS, die COSMAS II im Laufe der kommenden 2–3 Jahre ablösen wird, bietet gerade zur Erforschung grammatischer Variation einige besondere Funktionalitäten. Grundlegend ist beispielsweise, dass KorAP die Repräsentation und Abfrage beliebiger und beliebig vieler Annotationsschichten, zum Beispiel zu Konstituenz- und Dependenzrelationen, unterstutzt und damit die Suche nach speziellen grammatischen Phänomenen erleichtert oder erst möglich macht. Darüber hinaus unterstutzt KorAP die Konstruktion virtueller Korpora anhand von Metadatenvariablen und erleichtert damit kontrastive Untersuchungen. Der vorliegende Artikel erläutert die für die grammatische Variationsforschung relevanten KorAP-Funktionalitäten im Einzelnen und gibt einen Einblick in ihre Grundlagen.
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.