Human languages trade off complexity against efficiency
- A central goal of linguistics is to understand the diverse ways in which human language can be organized (Gibson et al. 2019; Lupyan/Dale 2016). In our contribution, we present results of a large scale cross-linguistic analysis of the statistical structure of written language (Koplenig/Wolfer/Meyer 2023) we approach this question from an information-theoretic perspective. To this end, we conduct a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections, so-called corpora, consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of un. To this end, we have trained a language model on more than 6,500 different documents as represented in 41 parallel/multilingual corpora consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population or ~46% of all languages that have a standardized written representation. Figure 1 shows that our database covers a large variety of different text types, e.g. religious texts, legalese texts, subtitles for various movies and talks, newspaper texts, web crawls, Wikipedia articles, or translated example sentences from a free collaborative online database. Furthermore, we use word frequency information from the Crúbadán project that aims at creating text corpora for a large number of (especially under-resourced) languages (Scannell 2007). We statistically infer the entropy rate of each language model as an information-theoretic index of (un)predictability/complexity (Schürmann/Grassberger 1996; Takahira/Tanaka-Ishii/Dębowski 2016). Equipped with this database and information-theoretic estimation framework, we first evaluate the so-called ‘equi-complexity hypothesis’, the idea that all languages are equally complex (Sampson 2009). We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. This constitutes evidence against the equi-complexity hypothesis from an information-theoretic perspective. We then present, discuss and evaluate evidence for a complexity-efficiency trade-off that unexpectedly emerged when we analysed our database: high-entropy languages tend to need fewer symbols to encode messages and vice versa. Given that, from an information theoretic point of view, the message length quantifies efficiency – the shorter the encoded message the higher the efficiency (Gibson et al. 2019) – this indicates that human languages trade off efficiency against complexity. More explicitly, a higher average amount of choice/uncertainty per produced/received symbol is compensated by a shorter average message length. Finally, we present results that could point toward the idea that the absolute amount of information in parallel texts is invariant across different languages.
| Author: | Alexander KoplenigORCiDGND, Sascha WolferORCiDGND, Peter MeyerORCiDGND |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-122783 |
| URL: | https://iclc10.ids-mannheim.de |
| DOI: | https://doi.org/10.14618/f8rt-m155 |
| ISBN: | 978-3-937241-96-8 |
| Parent Title (English): | 10th International Contrastive Linguistics Conference (ICLC-10), 18-21 July, 2023, Mannheim, Germany |
| Publisher: | IDS-Verlag |
| Place of publication: | Mannheim |
| Editor: | Beata Trawiński, Marc Kupietz, Kristel ProostGND, Jörg ZinkenORCiDGND |
| Document Type: | Part of a Book |
| Language: | English |
| Year of first Publication: | 2023 |
| Date of Publication (online): | 2023/11/07 |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | Peer-Review |
| Tag: | information theory; language complexity; language efficiency; quantitative linguistics; quantitative typology |
| GND Keyword: | Sprachstatistik |
| First Page: | 190 |
| Last Page: | 192 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Leibniz-Classification: | Sprache, Linguistik |
| Program areas: | Lexik |
| Licence (German): | Creative Commons - Namensnennung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland |


