TY  - CHAP
U1  - Buchbeitrag
A1  - Diewald, Nils
A1  - Kupietz, Marc
A1  - Lüngen, Harald
ED  - Klosa-Kückelhaus, Annette
ED  - Engelberg, Stefan
ED  - Möhrs, Christine
ED  - Storjohann, Petra
T1  - Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level
T2  - Dictionaries and Society. Proceedings of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany
N2  - When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.
KW  - Corpora
KW  - tokenization
KW  - software
KW  - Korpus <Linguistik>
Y1  - 2022
UN  - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111464
UR  - https://euralex2022.ids-mannheim.de/wp-content/uploads/2022/07/Proceedings_11.07.2022.pdf
SN  - 978-3-937241-87-6
SB  - 978-3-937241-87-6
U6  - https://doi.org/10.14618/ids-pub-11146
DO  - https://doi.org/10.14618/ids-pub-11146
SP  - 208
EP  - 221
PB  - IDS-Verlag
CY  - Mannheim
ER  -