Refine
Year of publication
Document Type
- Part of a Book (18)
- Conference Proceeding (9)
- Article (4)
- Working Paper (1)
Keywords
- Korpus <Linguistik> (29)
- Korpusanalyseplattform (KorAP) (5)
- Computerlinguistik (4)
- Deutsch (4)
- Forschungsdaten (4)
- Kontrastive Linguistik (4)
- Rumänisch (4)
- Sprachdaten (4)
- Benutzeroberfläche (3)
- Deutsches Referenzkorpus (DeReKo) (3)
Publicationstate
- Veröffentlichungsversion (17)
- Zweitveröffentlichung (10)
- Postprint (1)
Reviewstate
- Peer-Review (15)
- (Verlags)-Lektorat (10)
- Review-Status-unbekannt (1)
Publisher
- de Gruyter (6)
- Editura Academiei Române (3)
- European Language Resources Association (ELRA) (3)
- IDS-Verlag (3)
- Leibniz-Institut für Deutsche Sprache (2)
- CECL Papers 1 (1)
- De Gruyter (1)
- European Language Resources Association (1)
- European language resources association (ELRA) (1)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (1)
We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given user’s permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.
The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.
It is well known that the distribution of lexical and grammatical patterns is size- and register-sensitive (Biber 1986, and later publications). This fact alone presents a challenge to many corpus-oriented linguistic studies focusing on a single language. When it comes to cross-linguistic studies using corpora, the challenge becomes even greater due to the lack of high-quality multilingual corpora (Kupietz et al. 2020; Kupietz/Trawiński 2022), which are comparable with respect to the size and the register. That was the motivation for the creation of the European Reference Corpus EuReCo, an initiative started in 2013 at the Leibniz Institute for the German Language (IDS) together with several European partners (Kupietz et al. 2020). EuReCo is an emerging federated corpus, with large virtual comparable corpora across various languages and with an infrastructure supporting contrastive research. The core of the infrastructure is KorAP (Diewald et al. 2016), a scalable open-source platform supporting the analysis and visualisation of properties of texts annotated by multiple and potentially conflicting information layers, and supporting several corpus query languages. Until recently, EuReCo consisted of three monolingual subparts: the German Reference Corpus DeReKo (Kupietz et al. 2018), the Reference Corpus of Contemporary Romanian Language (Barbu Mititelu/Tufiş/Irimia 2018), and the Hungarian National Corpus (Váradi 2002). The goal of the present submission is twofold. On the one hand, it reports about the new component of EuReCo: a sample of the National Corpus of Polish (Przepiórkowski et al. 2010). On the other hand, it presents the results of a new pilot study using the newly extended EuReCo. This pilot study investigates selected Polish collocations involving light verbs and their prepositional / nominal complements (Fig. 1) and extends the collocation analyses of German, Romanian and Hungarian (Fig. 2) discussed in Kupietz/Trawiński (2022).
The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus analysis systems, which lack a common protocol. In this paper, we present KoralQuery, a JSON-LD based general corpus query protocol, aiming to be independent of particular QLs, tasks and corpus formats. In addition to describing the system of types and operations that Koral- Query is built on, we exemplify the representation of corpus queries in the serialized format and illustrate use cases in the KorAP project.
The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions.
This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.
The user interfaces for corpus analysis platforms must provide a high degree of accessibility for ordinary users and at the same time provide the possibility to answer complex research questions. In this paper, we present the design concepts behind the user interface of KorAP, a corpus analysis platform that has evolved into the main gateway to CoRoLa, the Reference Corpus of Contemporary Romanian Language. Based on established principles of user interface design, we show how KorAP addresses the challenge of providing a user-friendly interface for heterogeneous corpus data to a wide range of users with different research questions.
KoralQuery 0.3
(2015)
KoralQuery is a general corpus query protocol (i.e. independent of research tasks and corpus formats), serialized in JSON-LD [1]. KoralQuery focuses on simplicity of implementation rather than human readibility and writability. Support for a growing number of query languages is granted by the Koral serialization processor.
Die Korpusanalyseplattform KorAP ist von Grund auf sprachenunabhängig konzipiert. Dies gilt sowohl in Bezug auf die Lokalisierung der Benutzeroberfläche als auch hinsichtlich unterschiedlicher Anfragesprachen und der Unterstützung fremdsprachiger Korpora und ihren Annotationen. Diese Eigenschaften dienen im Rahmen der EuReCo Initiative aktuell besonders der Bereitstellung weiterer National- und Referenzkorpora neben DeReKo. EuReCo versucht, Kompetenzen beim Aufbau großer Korpora zu bündeln und durch die Verfügbarmachung vergleichbarer Korpora quantitative Sprachvergleichsforschung zu erleichtern. Hierzu bietet KorAP inzwischen, neben dem Zugang durch die Benutzeroberfläche, einen Web API Client an, der statistische Erhebungen, auch korpusübergreifend, vereinfacht.
Für die spezifischen Bedürfnisse der Schreibbeobachtung wurde das Orthografische Kernkorpus (OKK) als virtuelles Korpus in DeReKo entwickelt. Mit derzeit rund 14 Mrd. Token deckt es den Schriftsprachgebrauch in den deutschsprachigen Ländern im Zeitraum von 1995 bis in die Gegenwart ab. Der Zugriff über die Korpusanalyseplattform KorAP erlaubt nicht nur die Nutzung verschiedener Annotationen, sondern über die API-Schnittstellen auch die Einbindung in diverse Auswertungsumgebungen wie RStudio über den RKorAPClient und macht es so für zahlreiche Analyse- und Visualisierungsmöglichkeiten zugänglich.