Refine
Year of publication
Document Type
- Article (12)
- Part of a Book (7)
- Conference Proceeding (6)
- Book (1)
- Part of Periodical (1)
- Review (1)
Has Fulltext
- yes (28)
Keywords
- Textkorpus (28) (remove)
Publicationstate
Reviewstate
Publisher
Neues von KorAP
(2019)
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.
Der Beitrag stellt dar, in welch hohem Maße sprachliche Spielfreude zur Ausbildung großer Kompositagruppen rund um ein anregendes Musterwort führen kann und wie dieser Spieltrieb auch eine Fülle von Varianten zu interessanten literarischen oder politischen Formulierungen hervorbringt. Um solche Formulierspiele quer durch aktuelle Zeitungen in Deutschland, in Österreich und in der Schweiz verfolgen zu können, empfiehlt sich die Benutzung eines großen digitalen Erfassungssytems wie COSMAS II aus dem Institut für Deutsche Sprache in Mannheim.
Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora.