Volltext-Downloads (blau) und Frontdoor-Views (grau)
The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 1 of 10
Back to Result List

Matrix and double-array representations for efficient finite state tokenization

  • This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Nils DiewaldORCiDGND
URN:urn:nbn:de:bsz:mh39-111091
URL:http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.4.pdf
ISBN:979-10-95546-83-2
Parent Title (English):Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022
Publisher:European Language Resources Association (ELRA)
Place of publication:Paris
Editor:Piotr Bański, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen
Document Type:Conference Proceeding
Language:English
Year of first Publication:2022
Date of Publication (online):2022/07/01
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:corpora; finite state; finite state tokenization; space-delimited languages; tokenization
GND Keyword:Algorithmus; Datenstruktur; Deutsch; Endlicher Zustandsraum; Korpus <Linguistik>
First Page:20
Last Page:26
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Program areas:S1: Korpuslinguistik
Licence (English):License LogoCreative Commons - Attribution-NonCommercial 4.0 International