A large quantitative analysis of written language challenges the idea that all languages are equally complex
- One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective.
Author: | Alexander KoplenigORCiDGND, Sascha WolferORCiDGND, Peter MeyerORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-121243 |
DOI: | https://doi.org/10.1038/s41598-023-42327-3 |
ISSN: | 2045-2322 |
Parent Title (English): | Scientific Reports |
Publisher: | Springer Nature |
Place of publication: | Berlin |
Document Type: | Article |
Language: | English |
Year of first Publication: | 2023 |
Date of Publication (online): | 2023/09/19 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | average prediction complexity; cross-linguistic analysis; equi-complexity hypothesis; language complexity; language model; quantitative analysis; written language |
GND Keyword: | Entropie; Informationstheorie; Komplexität; Kontrastive Linguistik; Korpus <Linguistik>; Sprachstatistik |
Volume: | 13 |
Article Number: | 15351 |
Page Number: | 16 |
Note: | Finanzierung über Springer-DEAL-Vetrag |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Quantitative Linguistik |
Program areas: | L1: Lexikographie und Sprachdokumentation |
Program areas: | L3: Lexik empirisch und digital |
Licence (English): | Creative Commons - Attribution 4.0 International |