Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus
- In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of ∼43 billion words. We only find very little support for the primary data point reported by PT&G.
Author: | Alexander KoplenigORCiDGND, Marc KupietzORCiDGND, Sascha WolferORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-110893 |
DOI: | https://doi.org/10.1111/cogs.13090 |
ISSN: | 1551-6709 |
Parent Title (English): | Cognitive Science |
Publisher: | Wiley |
Place of publication: | Hoboken |
Document Type: | Article |
Language: | English |
Year of first Publication: | 2022 |
Date of Publication (online): | 2022/06/15 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung] |
Contributing Corporation: | Cognitive Science Society |
Publicationstate: | Zweitveröffentlichung |
Publicationstate: | Postprint |
Reviewstate: | Peer-Review |
Tag: | Deutsches Referenzkorpus (DeReKo) N-gram modeling; compression; corpus linguistics; information theory; large-scale corpora; uniform information density |
GND Keyword: | Deutsch; Informationsgehalt; Informationstheorie; Korpus <Linguistik>; Vorhersagbarkeit; Worthäufigkeit; Wortlänge |
Volume: | 46 |
Issue: | 6 |
Page Number: | 10 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Korpuslinguistik |
Program areas: | L3: Lexik empirisch und digital |
Program areas: | S1: Korpuslinguistik |
Licence (German): | Urheberrechtlich geschützt |