OPUS 4 | Search

3 search hits

1 to 3

Sort by

Relevancy
Year
Year
Title
Title
Author
Author

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level (2022)

Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Matrix and double-array representations for efficient finite state tokenization (2022)

Diewald, Nils

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level (2022)

Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state-ofthe-art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

1 to 3

Person(s)
Title
Subject
Abstract
Fulltext
Year(s)

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

3 search hits