400 Sprache, Linguistik
Refine
Year of publication
- 2022 (1)
Document Type
Language
- English (1)
Has Fulltext
- yes (1)
Is part of the Bibliography
- yes (1)
Keywords
- Endlicher Zustandsraum (1) (remove)
Publicationstate
Reviewstate
- Peer-Review (1)
Publisher
This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.