Corpus size strongly matters when analysing word frequency distributions
- In a previous study, Aceves and Evans present a large-scale quantitative information-theoretic analysis of parallel corpus data in ~1,000 languages to show that there are apparently strong associations between the way languages encode information into words and patterns of communication, e.g. the configuration of semantic information. During the peer review process, one reviewer raised the question of the extent to which the presented results depend on different corpus sizes (see the Peer Review File). This is a very important question given that most, if not all, of the quantities associated with word frequency distributions vary systematically with corpus size. While Aceves and Evans claim that corpus size does not affect the results presented, I challenge this view by presenting reanalyses of the data that clearly suggest that it does.
Author: | Alexander KoplenigORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-125399 |
DOI: | https://doi.org/10.31219/osf.io/p5nhd |
Publisher: | OSF Preprints, Center for Open Science |
Place of publication: | Charlottesville, VA |
Document Type: | Preprint |
Language: | English |
Year of first Publication: | 2024 |
Date of Publication (online): | 2024/02/29 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Tag: | corpus size; information density; word frequency distribution |
GND Keyword: | Häufigkeitsverteilung; Kommunikation; Korpus <Linguistik>; Quantitative Analyse; Worthäufigkeit |
Page Number: | 9 |
Note: | Supplemental materials |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Computerlinguistik |
Linguistics-Classification: | Korpuslinguistik |
Program areas: | Lexik |
Licence (English): | Creative Commons - Attribution 4.0 International |