Volltext-Downloads (blau) und Frontdoor-Views (grau)

A large quantitative analysis of written language challenges the idea that all languages are equally complex

  • One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective.

Export metadata

Statistics

frontdoor_oas
Metadaten
Author:Alexander KoplenigORCiDGND, Sascha WolferORCiDGND, Peter MeyerORCiDGND
URN:urn:nbn:de:bsz:mh39-121243
DOI:https://doi.org/10.1038/s41598-023-42327-3
ISSN:2045-2322
Parent Title (English):Scientific Reports
Publisher:Springer Nature
Place of publication:Berlin
Document Type:Article
Language:English
Year of first Publication:2023
Date of Publication (online):2023/09/19
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:average prediction complexity; cross-linguistic analysis; equi-complexity hypothesis; language complexity; language model; quantitative analysis; written language
GND Keyword:Entropie; Informationstheorie; Komplexität; Kontrastive Linguistik; Korpus <Linguistik>; Sprachstatistik
Volume:13
Article Number:15351
Page Number:16
Note:
Finanzierung über Springer-DEAL-Vetrag
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Quantitative Linguistik
Program areas:L1: Lexikographie und Sprachdokumentation
Program areas:L3: Lexik empirisch und digital
Licence (English):License LogoCreative Commons - Attribution 4.0 International