Refine
Year of publication
Document Type
- Part of a Book (19)
- Conference Proceeding (8)
- Article (3)
- Other (1)
Has Fulltext
- yes (31)
Is part of the Bibliography
- yes (31) (remove)
Keywords
- Korpus <Linguistik> (30)
- Deutsch (12)
- Annotation (8)
- Computerunterstützte Kommunikation (8)
- Deutsches Referenzkorpus (DeReKo) (6)
- CMC (4)
- TEI (4)
- Text Encoding Initiative (4)
- Textkorpus (4)
- computer-mediated communication (4)
Publicationstate
- Veröffentlichungsversion (22)
- Zweitveröffentlichung (6)
- Postprint (1)
Reviewstate
- Peer-Review (15)
- (Verlags)-Lektorat (7)
- Peer-review (3)
- Verlags-Lektorat (1)
Publisher
- de Gruyter (6)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- IDS-Verlag (2)
- Institut für Deutsche Sprache (2)
- Leibniz-Institut für Deutsche Sprache (IDS) (2)
- University of Antwerp (2)
- Cergy-Pontoise University, France (1)
- De Gruyter (1)
- ELRA (1)
- European Language Resources Association (ELRA) (1)
Der Beitrag beschäftigt sich mit der Frage, wie und inwieweit korpusbasierte Ansätze zur Untersuchung und Bewertung von Sprachwandel beitragen können. Die Bewertung von Sprachwandel erscheint in dieser Hinsicht interessant, da sie erstens von größerem öffentlichen Interesse ist, zweitens nicht zu den Kernthemen der Sprachwissenschaft zählt und drittens sowohl die geisteswissenschaftlichen Aspekte der Sprachwissenschaft berührt als auch die empirischen, die eher für die so genannten harten Wissenschaften typisch sind. Letzteres trifft bei der Frage nach Sprachverfall (gutem vs. schlechtem Deutsch diachron) vermutlich unbestrittener zu als bei der Frage nach richtigem vs. falschem Deutsch, da zu ihrer Beantwortung offensichtlich einerseits empirische, messbare Kriterien herangezogen werden müssen, andererseits aber auch weitere Kriterien notwendig sind und es außerdem einer Entscheidung zur Einordnung und Gewichtung der verschiedenartigen Kriterien sowie einer Begründung dieser Entscheidung bedarf. Zur Annäherung an die Fragestellung werden zunächst gängige, leicht operationalisierbare Hypothesen zu Symptomen eines potenziellen Verfalls des Deutschen auf verschiedenen DeReKo-basierten Korpora überprüft und im Hinblick auf ihre Verallgemeinerbarkeit und Tragweite diskutiert. Im zweiten Teil werden weitere empirische Ansätze zur Untersuchung von Wandel, Variation und Dynamik skizziert, die zur Diskussion spezieller Aspekte von Sprachverfall beitragen könnten. Im Schlussteil werden die vorgestellten Ansätze in den Gesamtkontext einer sprachwissenschaftlichen Untersuchung von Sprachverfall gestellt und vor dem Hintergrund seines gesellschaftlichen Diskurses reflektiert.
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future research interface. Due to the recent acquisition of new licenses, DeReKo has grown by a factor of four in the first half of 2014, mostly in the area of newspaper text, and presently contains over 24 billion word tokens. Other strata, like fictional texts, web corpora, in particular CMC texts, and spoken but conceptually written texts have also increased significantly. We report on the newly acquired corpora that led to the major increase, on the principles and strategies behind our corpus acquisition activities, and on our solutions for the emerging legal, organisational, and technical challenges.
Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim
(2014)
Very large corpora have been built and used at the IDS since its foundation in 1964. They have been made available on the Internet since the beginning of the 90’s to currently over 30,000 researchers worldwide. The Institute provides the largest archive of written German (Deutsches Referenzkorpus, DeReKe) which has recently been extended to 24 billion words. DeReKe has been managed and analysed by engines known as COSMAS and afterwards COSMAS II, which is currently being replaced by a new, scalable analysis platform called KorAP. KorAP makes it possible to manage and analyse texts that are accompanied by multiple, potentially conflicting, grammatical and structural annotation layers, and is able to handle resources that are distributed across different, and possibly geographically distant, storage systems. The majority of texts in DeReKe are not licensed for free redistribution, hence, the COSMAS and KorAP systems offer technical solutions to facilitate research on very large corpora that are not available (and not suitable) for download. For the new KorAP system, it is also planned to provide sandboxed environments to support non-remote-API access “near the data” through which users can run their own analysis programs.
Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora.
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.
Valenz und Kookkurrenz
(2015)
CMC Corpora in DeReKo
(2017)
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively.
As a consequence of a recent curation project, the Dortmund Chat Corpus is available in CLARIN-D research infrastructures for download and querying. In a legal expertise it had been recommended that standard measures of anonymisation be applied to the corpus before its republication. This paper reports about the anonymisation campaign that was conducted for the corpus. Anonymisation has been realised as categorisation, and the taxonomy of anonymisation categories applied is introduced and the method of applying it to the TEI files is demonstrated. The results of the anonymisation campaign as well as issues of quality assessment are discussed. Finally, pseudonymisation as an alternative to categorisation as a method of the anonymisation of CMC data is discussed, as well as possibilities of an automatisation of the process.