Korpuslinguistik
Refine
Year of publication
Document Type
- Conference Proceeding (12)
- Article (1)
- Part of a Book (1)
Has Fulltext
- yes (14)
Keywords
- Korpus <Linguistik> (13)
- Deutsch (3)
- Deutsches Referenzkorpus (DeReKo) (3)
- Visualisierung (3)
- Language Variation (2)
- Langzeitarchivierung (2)
- Semasiologie (2)
- Sprachvariante (2)
- Sprachwandel (2)
- word embeddings (2)
Publicationstate
- Veröffentlichungsversion (7)
- Postprint (3)
Reviewstate
- Peer-Review (8)
- (Verlags)-Lektorat (2)
Publisher
- European Language Resources Association (ELRA) (4)
- European Language Resources Association (1)
- European language resources association (ELRA) (1)
- Jagiellonian University; Pedagogical University (1)
- Lancaster University (1)
- Leibniz-Institut für Deutsche Sprache (1)
- Leibniz-Institut für Deutsche Sprache (IDS) (1)
- Nisaba (1)
- University of Birmingham (1)
- Universität Zürich (1)
We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora.
We evaluate a graph-based dependency parser on DeReKo, a large corpus of contemporary German. The dependency parser is trained on the German dataset from the SPMRL 2014 Shared Task which contains text from the news domain, whereas DeReKo also covers other domains including fiction, science, and technology. To avoid the need for costly manual annotation of the corpus, we use the parser’s probability estimates for unlabeled and labeled attachment as main evaluation criterion. We show that these probability estimates are highly correlated with the actual attachment scores on a manually annotated test set. On this basis, we compare estimated parsing scores for the individual domains in DeReKo, and show that the scores decrease with increasing distance of a domain to the training corpus.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.
Distributional models of word use constitute an indispensable tool in corpus based lexicological research for discovering paradigmatic relations and syntagmatic patterns (Belica et al. 2010). Recently, word embeddings (Mikolov et al. 2013) have revived the field by allowing to construct and analyze distributional models on very large corpora. This is accomplished by reducing the very high dimensionality of word cooccurrence contexts, the size of the vocabulary, to few dimensions, such as 100-200. However, word use and meaning can vary widely along dimensions such as domain, register, and time, and word embeddings tend to represent only the most prevalent meaning. In this paper we thus construct domain specific word embeddings to allow for systematically analyzing variations in word use. Moreover, we also demonstrate how to reconstruct domain specific co-occurrence contexts from the dense word embeddings.