Volltext-Downloads (blau) und Frontdoor-Views (grau)

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

  • When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state-ofthe-art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Nils DiewaldORCiDGND, Marc KupietzORCiDGND, Harald LüngenGND
URN:urn:nbn:de:bsz:mh39-111245
URL:https://euralex2022.ids-mannheim.de/wp-content/uploads/2022/07/Book_of_Abstract_11.07.2022.pdf
ISBN:978-3-937241-86-9
Parent Title (English):Dictionaries and Society. Book of ­Abstracts of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany
Publisher:IDS-Verlag
Place of publication:Mannheim
Editor:Annette Klosa-Kückelhaus, Stefan Engelberg, Christine Möhrs, Petra Storjohann
Document Type:Article
Language:English
Year of first Publication:2022
Date of Publication (online):2022/07/12
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:corpora; software; tokenization
GND Keyword:Automatische Sprachanalyse; Daten; Deutsch; Korpus <Linguistik>; Software
First Page:76
Last Page:76
DDC classes:400 Sprache / 420 Englisch
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Linguistics-Classification:Lexikografie
Program areas:S1: Korpuslinguistik
Licence (German):License LogoCreative Commons - Namensnennung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland