Volltext-Downloads (blau) und Frontdoor-Views (grau)

Human languages trade off complexity against efficiency

  • From a cross-linguistic perspective, language models are interesting because they can be used as idealised language learners that learn to produce and process language by being trained on a corpus of linguistic input. In this paper, we train different language models, from simple statistical models to advanced neural networks, on a database of 41 multilingual text collections comprising a wide variety of text types, which together include nearly 3 billion words across more than 6,500 documents in over 2,000 languages. We use the trained models to estimate entropy rates, a complexity measure derived from information theory. To compare entropy rates across both models and languages, we develop a quantitative approach that combines machine learning with semiparametric spatial filtering methods to account for both language- and document-specific characteristics, as well as phylogenetic and geographical language relationships. We first establish that entropy rate distributions are highly consistent across different language models, suggesting that the choice of model may have minimal impact on cross-linguistic investigations. On the basis of a much broader range of language models than in previous studies, we confirm results showing systematic differences in entropy rates, i.e. text complexity, across languages. These results challenge the long-held notion that all languages are equally complex. We then show that higher entropy rate tends to co-occur with shorter text length, and argue that this inverse relationship between complexity and length implies a compensatory mechanism whereby increased complexity is offset by increased efficiency. Finally, we introduce a multi-model multilevel inference approach to show that this complexity-efficiency trade-off is partly influenced by the social environment in which languages are used: languages spoken by larger communities tend to have higher entropy rates while using fewer symbols to encode messages.

Export metadata

Statistics

frontdoor_oas
Metadaten
Author:Alexander KoplenigORCiDGND, Sascha WolferORCiDGND, Jan Oliver RüdigerORCiDGND, Peter MeyerORCiDGND
URN:urn:nbn:de:bsz:mh39-130065
DOI:https://doi.org/10.1371/journal.pcsy.0000032
ISSN:2837-8830
Parent Title (English):PLOS Complex Systems
Publisher:PLOS
Place of publication:San Francisco, CA
Document Type:Article
Language:English
Year of first Publication:2025
Date of Publication (online):2025/02/11
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:entropy rate; language complexity; language models
GND Keyword:Computerlinguistik; Großes Sprachmodell; Informationstheorie; Kontrastive Linguistik; Korpus <Linguistik>; Maschinelles Lernen; Modell; Statistik
Volume:2
Issue:2
Article Number:e0000032
Page Number:42
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Linguistics-Classification:Computerlinguistik
Program areas:Lexik
Licence (English):License LogoCreative Commons - Attribution 4.0 International