Refine
Document Type
- Part of a Book (7) (remove)
Has Fulltext
- yes (7)
Is part of the Bibliography
- yes (7)
Keywords
Publicationstate
Reviewstate
- (Verlags)-Lektorat (4)
- Peer-Review (2)
Publisher
Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.
Die Korpusanalyseplattform KorAP ist von Grund auf sprachenunabhängig konzipiert. Dies gilt sowohl in Bezug auf die Lokalisierung der Benutzeroberfläche als auch hinsichtlich unterschiedlicher Anfragesprachen und der Unterstützung fremdsprachiger Korpora und ihren Annotationen. Diese Eigenschaften dienen im Rahmen der EuReCo Initiative aktuell besonders der Bereitstellung weiterer National- und Referenzkorpora neben DeReKo. EuReCo versucht, Kompetenzen beim Aufbau großer Korpora zu bündeln und durch die Verfügbarmachung vergleichbarer Korpora quantitative Sprachvergleichsforschung zu erleichtern. Hierzu bietet KorAP inzwischen, neben dem Zugang durch die Benutzeroberfläche, einen Web API Client an, der statistische Erhebungen, auch korpusübergreifend, vereinfacht.
KorAP, die neue Korpusanalyseplattform des IDS, die COSMAS II im Laufe der kommenden 2–3 Jahre ablösen wird, bietet gerade zur Erforschung grammatischer Variation einige besondere Funktionalitäten. Grundlegend ist beispielsweise, dass KorAP die Repräsentation und Abfrage beliebiger und beliebig vieler Annotationsschichten, zum Beispiel zu Konstituenz- und Dependenzrelationen, unterstutzt und damit die Suche nach speziellen grammatischen Phänomenen erleichtert oder erst möglich macht. Darüber hinaus unterstutzt KorAP die Konstruktion virtueller Korpora anhand von Metadatenvariablen und erleichtert damit kontrastive Untersuchungen. Der vorliegende Artikel erläutert die für die grammatische Variationsforschung relevanten KorAP-Funktionalitäten im Einzelnen und gibt einen Einblick in ihre Grundlagen.
Neues von KorAP
(2019)
The International Comparable Corpus (ICC) (Kirk/Čermáková 2017; Čermáková et al. 2021) is an open initiative which aims to improve the empirical basis for contrastive linguistics by compiling comparable corpora for many languages and making them as freely available as possible as well as providing tools with which they can easily be queried and analysed. In this contribution we present the first release of written language parts of the ICC which includes corpora for Chinese, Czech, English, German, Irish (partly), and Norwegian. Each of the released corpora contains 400k words distributed over 14 different text categories according to the ICC specifications. Our poster covers the design basics of the ICC, its TEI encoding, a demonstration of using the ICC via different query tools, and an outlook on future plans.
Similar to the European Reference Corpus EuReCo (Kupietz et al. 2020), ICC follows the approach of reusing existing linguistic resources wherever possible in order to cover as many languages as possible with realistic effort in as short a time as possible. In contrast to EuReCo, however, comparable corpus pairs are not defined dynamically in the usage phase, but the compositions of the corpora are fixed in the ICC design. The approaches are thus complementary in this respect. The design principles and composition of the ICC are based on those of the International Corpus of English (ICE) (Greenbaum (ed.) 1996), with the deviation that the ICC includes the additional text category blog post and excludes spoken legal texts (see Čermáková et al. 2021 for details). ICC’s fixed-design approach has the advantage that all single-language corpora in the ICC have the same composition with respect to the selected text types and that this guarantees that the selected broad spectrum of potential influencing variables for linguistic variation is always represented. The disadvantage, however, is that this can only be achieved for quite small corpora and that the generalisability of comparative findings based on the ICC corpora will often need to be checked on larger monolingual corpora or translation corpora (Čermáková/Ebeling/Oksefjell Ebeling forthcoming). Arguing that such issues with comparability and representativeness are inevitable, in one way or the other, and need to be dealt with, our poster will discuss and exemplify the text selections in more detail.
Die Korpusanalyseplattform KorAP wird als Nachfolgesystem zu COSMAS II am Leibniz-Institut für Deutsche Sprache (IDS) entwickelt und erlaubt einen umfassenden Zugriff auf einen Teil von DeReKo (Kupietz et al. 2010). Trotz einiger noch fehlender Funktionalitäten ist KorAP bereits produktiv einsetzbar. Im Folgenden wollen wir am Beispiel der Untersuchung von Social-Media-Korpora einige neue Möglichkeiten und Besonderheiten vorstellen.
Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis tools, but also the availability of programming interfaces supporting access to the functionality of these tools from various analysis and development environments. RKorAPClient is a new research tool in the form of an R package that interacts with the Web API of the corpus analysis platform KorAP, which provides access to large annotated corpora, including the German reference corpus DeReKo with 45 billion tokens. In addition to optionally authenticated KorAP API access, RKorAPClient provides further processing and visualization features to simplify common corpus analysis tasks. This paper introduces the basic functionality of RKorAPClient and exemplifies various analysis tasks based on DeReKo, that are bundled within the R package and can serve as a basic framework for advanced analysis and visualization approaches.