Refine
Document Type
- Article (2)
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- yes (2)
Keywords
- Deutsch (2) (remove)
Publicationstate
- Postprint (2) (remove)
Reviewstate
- Peer-Review (2) (remove)
In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of ∼43 billion words. We only find very little support for the primary data point reported by PT&G.
The coronavirus pandemic may be the largest crisis the world has had to face since World War II. It does not come as a surprise that it is also having an impact on language as our primary communication tool. In this short paper, we present three inter-connected resources that are designed to capture and illustrate these effects on a subset of the German language: An RSS corpus of German-language newsfeeds (with freely available untruncated frequency lists), a continuously updated HTML page tracking the diversity of the vocabulary in the RSS corpus and a Shiny web application that enables other researchers and the broader public to explore the corpus in terms of basic frequencies.