TY - CHAP U1 - Konferenzveröffentlichung A1 - Diewald, Nils ED - Bański, Piotr ED - Barbaresi, Adrien ED - Clematide, Simon ED - Kupietz, Marc ED - Lüngen, Harald T1 - Matrix and double-array representations for efficient finite state tokenization T2 - Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022 N2 - This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality. KW - tokenization KW - finite state KW - corpora KW - Algorithmus KW - Endlicher Zustandsraum KW - Datenstruktur KW - Deutsch KW - Korpus KW - finite state tokenization KW - space-delimited languages Y1 - 2022 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111091 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111091 UR - http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.4.pdf SN - 979-10-95546-83-2 SB - 979-10-95546-83-2 SP - 20 EP - 26 PB - European Language Resources Association (ELRA) CY - Paris ER -