Refine
Document Type
- Conference Proceeding (3)
- Article (1)
Language
- English (4)
Has Fulltext
- yes (4)
Is part of the Bibliography
- yes (4)
Keywords
- Korpus <Linguistik> (4)
- Deutsch (2)
- Deutsches Referenzkorpus (DeReKo) (2)
- Semasiologie (2)
- Sprachwandel (2)
- Visualisierung (2)
- Automatische Sprachanalyse (1)
- Informationsgehalt (1)
- Informationstheorie (1)
- N-gram modeling (1)
Publicationstate
- Postprint (4) (remove)
Reviewstate
- Peer-Review (4)
Publisher
Distributional models of word use constitute an indispensable tool in corpus based lexicological research for discovering paradigmatic relations and syntagmatic patterns (Belica et al. 2010). Recently, word embeddings (Mikolov et al. 2013) have revived the field by allowing to construct and analyze distributional models on very large corpora. This is accomplished by reducing the very high dimensionality of word cooccurrence contexts, the size of the vocabulary, to few dimensions, such as 100-200. However, word use and meaning can vary widely along dimensions such as domain, register, and time, and word embeddings tend to represent only the most prevalent meaning. In this paper we thus construct domain specific word embeddings to allow for systematically analyzing variations in word use. Moreover, we also demonstrate how to reconstruct domain specific co-occurrence contexts from the dense word embeddings.
In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of ∼43 billion words. We only find very little support for the primary data point reported by PT&G.