Matrix and double-array representations for efficient finite state tokenization
- This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.
| Author: | Nils DiewaldORCiDGND |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-111091 |
| URL: | http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.4.pdf |
| ISBN: | 979-10-95546-83-2 |
| Parent Title (English): | Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022 |
| Publisher: | European Language Resources Association (ELRA) |
| Place of publication: | Paris |
| Editor: | Piotr BańskiORCiDGND, Adrien BarbaresiORCiDGND, Simon ClematideORCiDGND, Marc KupietzORCiDGND, Harald LüngenGND |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of first Publication: | 2022 |
| Date of Publication (online): | 2022/07/01 |
| Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | Peer-Review |
| Tag: | corpora; finite state; finite state tokenization; space-delimited languages; tokenization |
| GND Keyword: | Algorithmus; Datenstruktur; Deutsch; Endlicher Zustandsraum; Korpus <Linguistik> |
| First Page: | 20 |
| Last Page: | 26 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Leibniz-Classification: | Sprache, Linguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Program areas: | S1: Korpuslinguistik |
| Licence (English): | Creative Commons - Attribution-NonCommercial 4.0 International |


