OPUS 4 | Search

17 search hits

1 to 10

Sort by

Relevancy
Year
Year
Title
Title
Author
Author

Krill: KorAP search and analysis engine (2016)

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level (2022)

Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

What's New in EuReCo? Interoperability, Comparable Corpora, Licensing (2019)

Kupietz, Marc ; Margaretha, Eliza ; Diewald, Nils ; Lüngen, Harald ; Fankhauser, Peter

This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.

How to get the computation near the data: improving data accessibility to, and reusability of analysis functions in corpus query platforms (2018)

Kupietz, Marc ; Diewald, Nils ; Fankhauser, Peter

The paper discusses use cases and proposals to increase the flexibility and reusability of components for analysis and further processing of analysis results in corpus query platforms by providing standardized interfaces to access data at multiple levels.

Texttechnologie an der Universität Bielefeld (2006)

Witt, Andreas ; Diewald, Nils

In the mid-1990s, the Faculty of Linguistics and Literary-Studies at Bielefeld University began to establish the field Text technology, both in research and education. Text technology is a new field of research on the border of Computational Linguistics and Computational Philology. This paper focuses on Text technology in academic education. In 2002, Text Technology was introduced as a minor subject for B.A. Programs. It is organized in modules: Module 1 introduces the characteristics of electronic texts and documents, typography, typesetting systems and hypertext. Module 2 introduces one or two programming languages relevant to the field of humanities computing. Markup languages and the principles of information structuring are the main topics of Module 3. The formal fundamentals of computer-based text processing, as formal languages and their grammars, Logics et cetera are subjects of another module. The paper ends with a short description of other Bachelor- and Master-Programs at Bielefeld University which contain text technological themes.

KoralQuery - a General Corpus Query Protocol (2015)

Bingel, Joachim ; Diewald, Nils

The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus analysis systems, which lack a common protocol. In this paper, we present KoralQuery, a JSON-LD based general corpus query protocol, aiming to be independent of particular QLs, tasks and corpus formats. In addition to describing the system of types and operations that Koral- Query is built on, we exemplify the representation of corpus queries in the serialized format and illustrate use cases in the KorAP project.

KorAP architecture – diving in the deep sea of corpus data (2016)

Diewald, Nils ; Hanl, Michael ; Margaretha, Eliza ; Bingel, Joachim ; Kupietz, Marc ; Bański, Piotr ; Witt, Andreas

KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.

Aufbau einer Korpusinfrastruktur für die Beobachtung des Schreibgebrauchs (2016)

Fischer, Peter M. ; Diewald, Nils ; Kupietz, Marc ; Witt, Andreas

Building paths to corpus data. A multi-level least effort and maximum return approach (2022)

Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza

Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.

Matrix and double-array representations for efficient finite state tokenization (2022)

Diewald, Nils

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

1 to 10

Person(s)
Title
Subject
Abstract
Fulltext
Year(s)

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

17 search hits