Quantitative Linguistik
Refine
Year of publication
Document Type
- Article (21)
- Part of a Book (19)
- Other (3)
- Doctoral Thesis (1)
- Working Paper (1)
Keywords
- Sprachstatistik (16)
- Deutsch (14)
- Wortschatz (12)
- Korpus <Linguistik> (11)
- Lexikostatistik (10)
- COVID-19 (8)
- Online-Medien (8)
- Vielfalt (8)
- Sprachwandel (6)
- Statistik (5)
Publicationstate
- Veröffentlichungsversion (26)
- Zweitveröffentlichung (18)
- Postprint (9)
Reviewstate
Publisher
- Leibniz-Institut für Deutsche Sprache (IDS) (7)
- Wilhelm Fink (6)
- de Gruyter (4)
- De Gruyter (3)
- Institut für Deutsche Sprache (2)
- MDPI (2)
- Springer Nature (2)
- Benjamins (1)
- Buske (1)
- Erich Schmidt (1)
cOWIDplus Analyse ist eine kontinuierlich aktualisierte Ressource zu der Frage, ob und wie stark sich der Wortschatz ausgewählter deutscher Online-Pressemeldungen während der Corona-Pandemie systematisch einschränkt und ob bzw. wann sich das Vokabular nach der Krise wieder ausweitet. In diesem Artikel erläutern die Autor*innen die hinter der Ressource stehende Forschungsfrage, die zugrunde gelegten Daten, die Methode sowie die bisherigen Ergebnisse.
Information theory can be used to assess how efficiently a message is transmitted on the basis of different symbolic systems. In this paper, I estimate the information-theoretic efficiency of written language for parallel text data in more than 1000 different languages, both on the level of characters and on the level of words as information encoding units. The main results show that (i) the median efficiency is ∼29% on the character level and ∼45% on the word level, (ii) efficiency on both levels is strongly correlated with each other and (iii) efficiency tends to be higher for languages with more speakers.
The annual microcensus provides Germany’s most important official statistics. Unlike a census it does not cover the whole population, but a representative 1%-sample of it. In 2017, the German microcensus asked a question on the language of the population, i.e. ‘Which language is mainly spoken in your household?’ Unfortunately, the question, its design and its position within the whole microcensus’ questionnaire feature several shortcomings. The main shortcoming is that multilingual repertoires cannot be captured by it. Recommendations for the improvement of the microcensus’ language question: first and foremost the question (i.e. its wording, design, and answer options) should make it possible to count multilingual repertoires.
The coronavirus pandemic may be the largest crisis the world has had to face since World War II. It does not come as a surprise that it is also having an impact on language as our primary communication tool. In this short paper, we present three inter-connected resources that are designed to capture and illustrate these effects on a subset of the German language: An RSS corpus of German-language newsfeeds (with freely available untruncated frequency lists), a continuously updated HTML page tracking the diversity of the vocabulary in the RSS corpus and a Shiny web application that enables other researchers and the broader public to explore the corpus in terms of basic frequencies.
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
(2020)
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
This paper argues that a lectometric approach may shed light on the distinction between destandardization and demotization, a pair of concepts that plays a key role in ongoing discussions about contemporary trends in standard languages. Instead of a binary distinction, the paper proposes three different types of destandardization, defined as quantitatively measurable changes in a stratigraphic language continuum. The three types are illustrated on the basis of a case study describing changes in the vocabulary of Dutch in The Netherlands and Flanders between 1990 and 2010.