Matrix and double-array representations for efficient finite state tokenization
- This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.
Author: | Nils DiewaldORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-111091 |
URL: | http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.4.pdf |
ISBN: | 979-10-95546-83-2 |
Parent Title (English): | Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022 |
Publisher: | European Language Resources Association (ELRA) |
Place of publication: | Paris |
Editor: | Piotr Bański, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2022 |
Date of Publication (online): | 2022/07/01 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | corpora; finite state; finite state tokenization; space-delimited languages; tokenization |
GND Keyword: | Algorithmus; Datenstruktur; Deutsch; Endlicher Zustandsraum; Korpus <Linguistik> |
First Page: | 20 |
Last Page: | 26 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Korpuslinguistik |
Program areas: | S1: Korpuslinguistik |
Licence (English): | Creative Commons - Attribution-NonCommercial 4.0 International |