Refine
Year of publication
Document Type
- Conference Proceeding (5)
- Part of a Book (4)
- Article (3)
- Other (2)
Has Fulltext
- yes (14)
Keywords
- Deutsches Referenzkorpus (DeReKo) (14) (remove)
Publicationstate
- Veröffentlichungsversion (14) (remove)
Reviewstate
- (Verlags)-Lektorat (7)
- Peer-Review (5)
Publisher
- Institut für Deutsche Sprache (3)
- European Language Resources Association (ELRA) (2)
- Narr (2)
- de Gruyter (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (1)
- Lancaster University (1)
- Leibniz-Institut für Deutsche Sprache (1)
- Leibniz-Institut für Deutsche Sprache (IDS) (1)
- Universitätsbibliothek Johann Christian Senckenberg (1)
Datensatz Schwache Maskulina
(2023)
Der Datensatz enthält eine Sammlung von 1.156 Substantiven (mit wenigen Ausnahmen Maskulina), die sich im Korpusgrammatik-Untersuchungskorpus (Bubenhofer et al. 2014), basierend auf dem Deutschen Referenzkorpus DeReKo (Kupietz et al. 2010, 2018), Release 2017-II, unmittelbar nach einem Beleg für die Akkusativ- oder Dativform des unbestimmten Artikels ( einen / einem ) mindestens einmal mit der “schwachen” Endung -(e)n belegen lassen (z.B. einen Aktivisten , einem Autoren ). Einzelheiten zur Datenerhebung in Weber & Hansen (2023).
Die durch die Covid-19-Pandemie bedingte Umstellung der Präsenzlehre auf digitale Lehr- und Lernformate stellte Lehrende und Studierende gleichermaßen vor eine Herausforderung. Innerhalb kürzester Zeit musste die Nutzung von Plattformen und digitalen Tools erlernt und getestet werden. Der Beitrag stellt exemplarisch Dienste und Werkzeuge von CLARIAH-DE vor und erläutert, wie die digitale Forschungsinfrastruktur Lehrende und Studierende auch im Rahmen der digitalen Lehre unterstützen kann.
Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.
We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora.
CMC Corpora in DeReKo
(2017)
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively.
This paper describes the efforts in the field of sustainability of the Institut für Deutsche Sprache (IDS) in Mannheim with respect to DEREKO (Deutsches Referenzkorpus) the Archive of General Reference Corpora of Contemporary Written German. With focus on re-usability and sustainability, we discuss its history and our future plans. We describe legal challenges related to the creation of a large and sustainable resource; sketch out the pipeline used to convert raw texts to the final corpus format and outline migration plans to TEI P5. Due to the fact, that the current version of the corpus management and query system is pushed towards its limits, we discuss the requirements for a new version which will be able to handle current and future DEREKO releases. Furthermore, we outline the institute’s plans in the field of digital preservation.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.