Korpuslinguistik
Refine
Document Type
- Article (4) (remove)
Has Fulltext
- yes (4)
Is part of the Bibliography
- yes (4)
Keywords
- Deutsch (4)
- Korpus <Linguistik> (4)
- Deutsches Referenzkorpus (DeReKo) (2)
- Worthäufigkeit (2)
- Ableitung <Linguistik> (1)
- Datenanalyse (1)
- German (1)
- Informationsgehalt (1)
- Informationstheorie (1)
- Konstruktionsgrammatik (1)
Publicationstate
- Veröffentlichungsversion (3)
- Postprint (1)
- Zweitveröffentlichung (1)
Reviewstate
- Peer-Review (3)
- (Verlags)-Lektorat (1)
Publisher
- Leibniz-Institut für Deutsche Sprache (IDS) (1)
- MDPI (1)
- Peter Lang (1)
- Wiley (1)
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
Based on the privative derivational suffix -los, we test statements found in the literature on word formation using a – at least in this field – novel empirical basis: a list of affective-emotional ratings of base nouns and associated -los derivations. In addition to a frequency analysis based on the German Reference Corpus, we show that, in general, emotional polarity (so-called valence, positive vs. negative emotions) is reversed by suffixation with -los. This change is stronger for more polarized base nouns. The perceived intensity of emotion (so-called arousal) is generally lower for -los derivations than for base nouns. Finally, to capture the results theoretically, we propose a prototypical -los construction in the framework of Construction Morphology.
In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of ∼43 billion words. We only find very little support for the primary data point reported by PT&G.