Human languages trade off complexity against efficiency

Koplenig, Alexander; Wolfer, Sascha; Meyer, Peter

doi:10.14618/f8rt-m155

A central goal of linguistics is to understand the diverse ways in which human language can be organized (Gibson et al. 2019; Lupyan/Dale 2016). In our contribution, we present results of a large scale cross-linguistic analysis of the statistical structure of written language (Koplenig/Wolfer/Meyer 2023) we approach this question from an information-theoretic perspective. To this end, we conduct a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections, so-called corpora, consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of un. To this end, we have trained a language model on more than 6,500 different documents as represented in 41 parallel/multilingual corpora consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population or ~46% of all languages that have a standardized written representation. Figure 1 shows that our database covers a large variety of different text types, e.g. religious texts, legalese texts, subtitles for various movies and talks, newspaper texts, web crawls, Wikipedia articles, or translated example sentences from a free collaborative online database. Furthermore, we use word frequency information from the Crúbadán project that aims at creating text corpora for a large number of (especially under-resourced) languages (Scannell 2007). We statistically infer the entropy rate of each language model as an information-theoretic index of (un)predictability/complexity (Schürmann/Grassberger 1996; Takahira/Tanaka-Ishii/Dębowski 2016). Equipped with this database and information-theoretic estimation framework, we first evaluate the so-called ‘equi-complexity hypothesis’, the idea that all languages are equally complex (Sampson 2009). We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. This constitutes evidence against the equi-complexity hypothesis from an information-theoretic perspective. We then present, discuss and evaluate evidence for a complexity-efficiency trade-off that unexpectedly emerged when we analysed our database: high-entropy languages tend to need fewer symbols to encode messages and vice versa. Given that, from an information theoretic point of view, the message length quantifies efficiency – the shorter the encoded message the higher the efficiency (Gibson et al. 2019) – this indicates that human languages trade off efficiency against complexity. More explicitly, a higher average amount of choice/uncertainty per produced/received symbol is compensated by a shorter average message length. Finally, we present results that could point toward the idea that the absolute amount of information in parallel texts is invariant across different languages.

Author:	Alexander Koplenig ORCiD GND, Sascha Wolfer ORCiD GND, Peter Meyer GND
URN:	urn:nbn:de:bsz:mh39-122783
URL:	https://iclc10.ids-mannheim.de
DOI:	https://doi.org/10.14618/f8rt-m155
ISBN:	978-3-937241-96-8
Parent Title (English):	10th International Contrastive Linguistics Conference (ICLC-10), 18-21 July, 2023, Mannheim, Germany
Publisher:	IDS-Verlag
Place of publication:	Mannheim
Editor:	Beata Trawiński, Marc Kupietz, Kristel Proost, Jörg Zinken
Document Type:	Part of a Book
Language:	English
Year of first Publication:	2023
Date of Publication (online):	2023/11/07
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	information theory; language complexity; language efficiency; quantitative linguistics; quantitative typology
GND Keyword:	Sprachstatistik
First Page:	190
Last Page:	192
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Program areas:	Lexik
Licence (German):	Creative Commons - Namensnennung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland

Open Access

Human languages trade off complexity against efficiency

Download full text files

Export metadata

Additional Services

Statistics