Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level
- When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state-ofthe-art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.
Author: | Nils DiewaldORCiDGND, Marc KupietzORCiDGND, Harald LüngenGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-111245 |
URL: | https://euralex2022.ids-mannheim.de/wp-content/uploads/2022/07/Book_of_Abstract_11.07.2022.pdf |
ISBN: | 978-3-937241-86-9 |
Parent Title (English): | Dictionaries and Society. Book of Abstracts of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany |
Publisher: | IDS-Verlag |
Place of publication: | Mannheim |
Editor: | Annette Klosa-Kückelhaus, Stefan Engelberg, Christine Möhrs, Petra Storjohann |
Document Type: | Part of a Book |
Language: | English |
Year of first Publication: | 2022 |
Date of Publication (online): | 2022/07/12 |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | corpora; software; tokenization |
GND Keyword: | Automatische Sprachanalyse; Daten; Deutsch; Korpus <Linguistik>; Software |
First Page: | 76 |
Last Page: | 76 |
DDC classes: | 400 Sprache / 420 Englisch |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Korpuslinguistik |
Linguistics-Classification: | Lexikografie |
Program areas: | S1: Korpuslinguistik |
Licence (German): | ![]() |