OPUS 4 | Korpuslinguistik

Korpuslinguistik

17 search hits

1 to 10

Sort by

Access control by query rewriting: the case of KorAP (2014)

Banski, Piotr ; Diewald, Nils ; Hanl, Michael ; Kupietz, Marc ; Witt, Andreas

We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given user’s permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.

Building paths to corpus data. A multi-level least effort and maximum return approach (2022)

Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza

Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.

How to find a shining needle in the haystack. Querying CoRoLa: solutions and perspectives (2019)

Cristea, Dan ; Diewald, Nils ; Haja, Gabriela ; Mărănduc, Cătălina ; Barbu Mititelu, Verginica ; Onofrei, Mihaela

The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions.

How to get the computation near the data: improving data accessibility to, and reusability of analysis functions in corpus query platforms (2018)

Kupietz, Marc ; Diewald, Nils ; Fankhauser, Peter

The paper discusses use cases and proposals to increase the flexibility and reusability of components for analysis and further processing of analysis results in corpus query platforms by providing standardized interfaces to access data at multiple levels.

KorAP architecture – diving in the deep sea of corpus data (2016)

Diewald, Nils ; Hanl, Michael ; Margaretha, Eliza ; Bingel, Joachim ; Kupietz, Marc ; Bański, Piotr ; Witt, Andreas

KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.

KorAP: the new corpus analysis platform at IDS Mannheim (2013)

Bański, Piotr ; Bingel, Joachim ; Diewald, Nils ; Frick, Elena ; Hanl, Michael ; Kupietz, Marc ; Pȩzik, Piotr ; Schnober, Carsten ; Witt, Andreas

The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.

Krill: KorAP search and analysis engine (2016)

Diewald, Nils ; Margaretha, Eliza

Lessons learned in quality management for online research software tools in linguistics (2021)

Diewald, Nils ; Margaretha, Eliza ; Kupietz, Marc

In this paper, we present our experiences and decisions in dealing with challenges in developing, maintaining and operating online research software tools in the field of linguistics. In particular, we highlight reproducibility, dependability, and security as important aspects of quality management – taking into account the special circumstances in which research software is usually created.

Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian (2019)

Tufiș, Dan ; Barbu Mititelu, Verginica ; Irimia, Elena ; Păiș, Vasile ; Ion, Radu ; Diewald, Nils ; Mitrofan, Maria ; Onofrei, Mihaela

The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.

Matrix and double-array representations for efficient finite state tokenization (2022)

Diewald, Nils

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

1 to 10

Open Access

Korpuslinguistik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

17 search hits