Quantitative Linguistik
Refine
Document Type
- Article (6) (remove)
Has Fulltext
- yes (6)
Keywords
- Deutsch (2)
- Sprachstatistik (2)
- Aussagesatz (1)
- Automatische Sprachanalyse (1)
- COVID-19 (1)
- Comparison of representations and representational formats (1)
- Deutschland (1)
- Effizienz (1)
- Empirische Linguistik (1)
- Englisch (1)
Publicationstate
- Zweitveröffentlichung (6) (remove)
Reviewstate
- Peer-Review (6)
Publisher
- Benjamins (1)
- De Gruyter (1)
- Erich Schmidt (1)
- Peeters (1)
- Stata Press (1)
- Universitätsverlag Rhein-Ruhr OHG (1)
Information theory can be used to assess how efficiently a message is transmitted on the basis of different symbolic systems. In this paper, I estimate the information-theoretic efficiency of written language for parallel text data in more than 1000 different languages, both on the level of characters and on the level of words as information encoding units. The main results show that (i) the median efficiency is ∼29% on the character level and ∼45% on the word level, (ii) efficiency on both levels is strongly correlated with each other and (iii) efficiency tends to be higher for languages with more speakers.
The coronavirus pandemic may be the largest crisis the world has had to face since World War II. It does not come as a surprise that it is also having an impact on language as our primary communication tool. In this short paper, we present three inter-connected resources that are designed to capture and illustrate these effects on a subset of the German language: An RSS corpus of German-language newsfeeds (with freely available untruncated frequency lists), a continuously updated HTML page tracking the diversity of the vocabulary in the RSS corpus and a Shiny web application that enables other researchers and the broader public to explore the corpus in terms of basic frequencies.
Seit 2017 wird im deutschen Mikrozensus eine Frage zur Sprache der Bevölkerung gestellt. Die letzte Spracherhebung in einem deutschen Zensus datiert aus dem Jahr 1939; entsprechend gibt es aktuell keine aussagekräftigen Sprachstatistiken in Deutschland. Die neue Sprachfrage des Mikrozensus weist jedoch erhebliche Mängel auf; offensichtlich wurde sie als Stellvertreterfrage zur Messung kultureller Integration konzipiert. Im vorliegenden Text werden die Fragen diskutiert und ihre ersten Ergebnisse analysiert. Daran anschließend werden andere Varianten von Sprachfragen dargestellt, dabei wird insbesondere auf die vorbildlichen Sprachfragen im kanadischen Zensus eingegangen. Abschließend wird die Sprachfrage der Deutschland-Erhebung 2018 des IDS inklusive ihrer Ergebnisse vorgestellt; die Deutschland-Erhebung 2018 stellt neben dem Mikrozensus bislang die einzige repräsentative Spracherhebung in Deutschland dar.
We present a technique called event mapping that allows to project text representations into event lists, produce an event table, and derive quantitative conclusions to compare the text representations. The main application of the technique is the case where two classes of text representations have been collected in two different settings (e.g., as annotations in two different formal frameworks) and we can compare the two classes with respect to their systematic differences in the event table. We illustrate how the technique works by applying it to data collected in two experiments (one using annotations in Vladimir Propp’s framework, the other using natural language summaries).
We compare the use of überhaupt and sowieso in Dutch and German. We use the world-wide web as the main resource and pursue a zigzag strategy, trying to find usages going back and forth between dictionaries, intuitions and real data obtained through web search. To our surprise, the results more or less confirm the decision of Dutch dictionaries to consider überhaupt and sowieso synonymous. In German, we find no synonymy, but only a great overlap of usage conditions in declarative sentences.