Refine
Year of publication
Document Type
- Part of a Book (9)
- Conference Proceeding (7)
- Article (1)
Has Fulltext
- yes (17)
Keywords
- Korpus <Linguistik> (15)
- Computerlinguistik (3)
- Deutsch (3)
- Korpusanalyseplattform (KorAP) (3)
- comparable corpora (3)
- corpus linguistics (3)
- tokenization (3)
- Automatische Sprachanalyse (2)
- Kontrastive Linguistik (2)
- Software (2)
Publicationstate
- Veröffentlichungsversion (17) (remove)
Reviewstate
- Peer-Review (11)
- (Verlags)-Lektorat (4)
- Review-Status-unbekannt (1)
Publisher
- IDS-Verlag (3)
- European Language Resources Association (ELRA) (2)
- Leibniz-Institut für Deutsche Sprache (2)
- CECL Papers 1 (1)
- European language resources association (ELRA) (1)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (1)
- IDS-Verlag; Leibniz-Institut für Deutsche Sprache (IDS) (1)
- Institut für Deutsche Sprache (1)
- Linköping University Electronic Press, Linköpings universitet (1)
- Mentis-Verlag (1)
It is well known that the distribution of lexical and grammatical patterns is size- and register-sensitive (Biber 1986, and later publications). This fact alone presents a challenge to many corpus-oriented linguistic studies focusing on a single language. When it comes to cross-linguistic studies using corpora, the challenge becomes even greater due to the lack of high-quality multilingual corpora (Kupietz et al. 2020; Kupietz/Trawiński 2022), which are comparable with respect to the size and the register. That was the motivation for the creation of the European Reference Corpus EuReCo, an initiative started in 2013 at the Leibniz Institute for the German Language (IDS) together with several European partners (Kupietz et al. 2020). EuReCo is an emerging federated corpus, with large virtual comparable corpora across various languages and with an infrastructure supporting contrastive research. The core of the infrastructure is KorAP (Diewald et al. 2016), a scalable open-source platform supporting the analysis and visualisation of properties of texts annotated by multiple and potentially conflicting information layers, and supporting several corpus query languages. Until recently, EuReCo consisted of three monolingual subparts: the German Reference Corpus DeReKo (Kupietz et al. 2018), the Reference Corpus of Contemporary Romanian Language (Barbu Mititelu/Tufiş/Irimia 2018), and the Hungarian National Corpus (Váradi 2002). The goal of the present submission is twofold. On the one hand, it reports about the new component of EuReCo: a sample of the National Corpus of Polish (Przepiórkowski et al. 2010). On the other hand, it presents the results of a new pilot study using the newly extended EuReCo. This pilot study investigates selected Polish collocations involving light verbs and their prepositional / nominal complements (Fig. 1) and extends the collocation analyses of German, Romanian and Hungarian (Fig. 2) discussed in Kupietz/Trawiński (2022).
Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.
The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus analysis systems, which lack a common protocol. In this paper, we present KoralQuery, a JSON-LD based general corpus query protocol, aiming to be independent of particular QLs, tasks and corpus formats. In addition to describing the system of types and operations that Koral- Query is built on, we exemplify the representation of corpus queries in the serialized format and illustrate use cases in the KorAP project.
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
Die Diskurslinguistik hat sich in den letzten Jahren als eine linguistische Teildisziplin etabliert, die in transtextuellen Untersuchungen über sprachliche Muster gesamtgesellschaftlich rele-vante Denk- und Vorstellungswelten rekonstruiert. Die Digitalisierung hat nicht nur unsere Gesellschaft grundlegend verändert und neue Kommunikationsformen und innovative kulturelle Praktiken geprägt, sondern auch das diskurslinguistische Arbeiten maßgeblich beein-flusst. So war die Etablierung der Diskurslinguistik sowie auch der diskursorientierten Lexikographie geprägt durch die Engführung mit computergestützten Methoden (Bubenhofer 2009, Teubert/Čermáková 2007, Halliday et al. 2004), die große Textsammlungen für Diskursanalysen zugänglich machen. Da diskursanalytische Forschung in foucaultscher Tradition nicht am Einzelbeleg interessiert ist, sondern mit kontextuellen Mustern und intertextuellen Verweisstrukturen arbeitet, bietet eine korpusgestützte Analyse eine produktive Ausgangsbasis für Diskursuntersuchungen. Dies gilt insbesondere für die Diskurslexikographie, bei der auf breiter Datenbasis Wörterbücher zu kulturhistorischen Diskursen erstellt werden.
In this paper, we present our experiences and decisions in dealing with challenges in developing, maintaining and operating online research software tools in the field of linguistics. In particular, we highlight reproducibility, dependability, and security as important aspects of quality management – taking into account the special circumstances in which research software
is usually created.
This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.