400 Sprache, Linguistik
Refine
Document Type
- Conference Proceeding (2)
- Part of a Book (1)
Language
- English (3)
Has Fulltext
- yes (3)
Keywords
- Datenstruktur (3) (remove)
Publicationstate
- Postprint (2)
- Zweitveröffentlichung (2)
- Veröffentlichungsversion (1)
Reviewstate
- Peer-Review (2)
- (Verlags)-Lektorat (1)
This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.
Lexical resources are often represented in table form, e. g., in relational databases, or represented in specially marked up texts, for example, in document based XML models. This paper describes how it is possible to model lexical structures as graphs and how this model can be used to exploit existing lexical resources and even how different types of lexical resources can be combined.