Volltext-Downloads (blau) und Frontdoor-Views (grau)

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

  • When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Nils DiewaldORCiDGND, Marc KupietzORCiDGND, Harald LüngenGND
URN:urn:nbn:de:bsz:mh39-111464
URL:https://euralex2022.ids-mannheim.de/wp-content/uploads/2022/07/Proceedings_11.07.2022.pdf
DOI:https://doi.org/10.14618/ids-pub-11146
ISBN:978-3-937241-87-6
Parent Title (English):Dictionaries and Society. Proceedings of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany
Publisher:IDS-Verlag
Place of publication:Mannheim
Editor:Annette Klosa-Kückelhaus, Stefan Engelberg, Christine Möhrs, Petra Storjohann
Document Type:Part of a Book
Language:English
Year of first Publication:2022
Date of Publication (online):2022/07/20
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:Corpora; software; tokenization
GND Keyword:Korpus <Linguistik>
First Page:208
Last Page:221
DDC classes:400 Sprache / 420 Englisch
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Linguistics-Classification:Lexikografie
Program areas:S1: Korpuslinguistik
Conferences, Workshops:Dictionaries and Society. Proceedings of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany
Licence (German):License LogoCreative Commons - CC BY-SA - Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International