Quantitative Linguistik
Refine
Year of publication
Document Type
- Article (21)
- Part of a Book (19)
- Other (3)
- Doctoral Thesis (1)
- Working Paper (1)
Keywords
- Sprachstatistik (16)
- Deutsch (14)
- Wortschatz (12)
- Korpus <Linguistik> (11)
- Lexikostatistik (10)
- COVID-19 (8)
- Online-Medien (8)
- Vielfalt (8)
- Sprachwandel (6)
- Statistik (5)
Publicationstate
- Veröffentlichungsversion (26)
- Zweitveröffentlichung (18)
- Postprint (9)
Reviewstate
Publisher
- Leibniz-Institut für Deutsche Sprache (IDS) (7)
- Wilhelm Fink (6)
- de Gruyter (4)
- De Gruyter (3)
- Institut für Deutsche Sprache (2)
- MDPI (2)
- Springer Nature (2)
- Benjamins (1)
- Buske (1)
- Erich Schmidt (1)
cOWIDplus Analyse ist eine kontinuierlich aktualisierte Ressource zu der Frage, ob und wie stark sich der Wortschatz ausgewählter deutscher Online-Pressemeldungen während der Corona-Pandemie systematisch einschränkt und ob bzw. wann sich das Vokabular nach der Krise wieder ausweitet. In diesem Artikel erläutern die Autor*innen die hinter der Ressource stehende Forschungsfrage, die zugrunde gelegten Daten, die Methode sowie die bisherigen Ergebnisse.
cOWIDplus Viewer
(2020)
cOWIDplus
(2020)
Die Corona-Krise hat Einfluss auf die Sprache in deutschsprachigen Online-Medien. Wir haben die Hypothese, dass sich die Vielfältigkeit des verwendeten Vokabulars einschränkt. Wir glauben zudem, dass sich die Diversität des Vokabulars nach "überstandener" Krise wieder auf ein "Prä-Pandemie-Niveau" einpendeln wird. Diese zweite Hypothese lässt sich erst im Laufe der Zeit überprüfen.
Seit 2017 wird im deutschen Mikrozensus eine Frage zur Sprache der Bevölkerung gestellt. Die letzte Spracherhebung in einem deutschen Zensus datiert aus dem Jahr 1939; entsprechend gibt es aktuell keine aussagekräftigen Sprachstatistiken in Deutschland. Die neue Sprachfrage des Mikrozensus weist jedoch erhebliche Mängel auf; offensichtlich wurde sie als Stellvertreterfrage zur Messung kultureller Integration konzipiert. Im vorliegenden Text werden die Fragen diskutiert und ihre ersten Ergebnisse analysiert. Daran anschließend werden andere Varianten von Sprachfragen dargestellt, dabei wird insbesondere auf die vorbildlichen Sprachfragen im kanadischen Zensus eingegangen. Abschließend wird die Sprachfrage der Deutschland-Erhebung 2018 des IDS inklusive ihrer Ergebnisse vorgestellt; die Deutschland-Erhebung 2018 stellt neben dem Mikrozensus bislang die einzige repräsentative Spracherhebung in Deutschland dar.
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
(2019)
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
We present a technique called event mapping that allows to project text representations into event lists, produce an event table, and derive quantitative conclusions to compare the text representations. The main application of the technique is the case where two classes of text representations have been collected in two different settings (e.g., as annotations in two different formal frameworks) and we can compare the two classes with respect to their systematic differences in the event table. We illustrate how the technique works by applying it to data collected in two experiments (one using annotations in Vladimir Propp’s framework, the other using natural language summaries).
Large-scale empirical evidence indicates a fascinating statistical relationship between the estimated number of language users and its linguistic and statistical structure. In this context, the linguistic niche hypothesis argues that this relationship reflects a negative selection against morphological paradigms that are hard to learn for adults, because languages with a large number of speakers are assumed to be typically spoken and learned by greater proportions of adults. In this paper, this conjecture is tested empirically for more than 2000 languages. The results question the idea of the impact of non-native speakers on the grammatical and statistical structure of languages, as it is demonstrated that the relative proportion of non-native speakers does not significantly correlate with either morphological or information-theoretic complexity. While it thus seems that large numbers of adult learners/speakers do not affect the (grammatical or statistical) structure of a language, the results suggest that there is indeed a relationship between the number of speakers and (especially) information-theoretic complexity, i.e. entropy rates. A potential explanation for the observed relationship is discussed.
We compare the use of überhaupt and sowieso in Dutch and German. We use the world-wide web as the main resource and pursue a zigzag strategy, trying to find usages going back and forth between dictionaries, intuitions and real data obtained through web search. To our surprise, the results more or less confirm the decision of Dutch dictionaries to consider überhaupt and sowieso synonymous. In German, we find no synonymy, but only a great overlap of usage conditions in declarative sentences.