Korpuslinguistik
Refine
Year of publication
Document Type
- Part of a Book (26)
- Conference Proceeding (14)
- Article (6)
Has Fulltext
- yes (46)
Keywords
- Korpus <Linguistik> (42)
- Deutsch (19)
- Annotation (11)
- Computerunterstützte Kommunikation (10)
- Deutsches Referenzkorpus (DeReKo) (9)
- Chatten <Kommunikation> (5)
- Computerlinguistik (5)
- CMC (4)
- Institut für Deutsche Sprache <Mannheim> (4)
- TEI (4)
Publicationstate
- Veröffentlichungsversion (34)
- Zweitveröffentlichung (7)
- Postprint (2)
Reviewstate
- Peer-Review (19)
- (Verlags)-Lektorat (18)
- Peer-review (3)
- Verlags-Lektorat (1)
Publisher
- Institut für Deutsche Sprache (6)
- de Gruyter (6)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- IDS-Verlag (2)
- University of Antwerp (2)
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Austrian Centre for Digital Humanities, Austrian Academy of Sciences (1)
- CLARIN (1)
- Campus (1)
- Cergy-Pontoise University, France (1)
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.
The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.
Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D
(2016)
The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.
Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas.
In this Paper, we describe a schema and models which have been developed for the representation of corpora of computer-mediated communicatin (CMC corpora) using the representation framework provided by the Text Encoding Initiative (TEI). We characterise CMC discourse as dialogic, sequentially organised interchange between humans and point out that many features of CMC are not adequately handled by current corpus encoding schemas and tools. We formulate desiderata for a representation of CMC in encoding schemes and argue why the TEI is a suitable framework for the encoding of CMC corpora. We propose a model of basic CMC units (utterances, posts, and nonverbal activities) and the macro- and micro-level structures of interactions in CMC environments. Based on these models, we introduce CMC-core, a TEI customisation for the encoding of CMC corpora, which defines CMC-specific encoding features on the four levels of elements, model classes, attribute classes, and modules of the TEI infrastructure. The description of our customisation is illustrated by encoding examples from corpora by researchers of the TEI SIG CMC, representing a variety of CMC genres, i.e. chat, wiki talk, twitter, blog, and Second Life interactions. The material described, i.e. schemata, encoding examples, and documentation, is available from the of the TEI CMC SIG Wiki and will accompany a feature request to the TEI council in late 2019.
Dieses Kapitel gibt einen Überblick über Korpora internetbasierter Kommunikation, die als digitale Ressourcen frei zur Verfügung stehen und für eigene linguistische Forschungsarbeiten genutzt werden können. In Abschnitt 1 erläutern wir korpuslinguistische Basiskonzepte, die für die Arbeit mit Korpora internetbasierter Kommunikation benötigt werden, und präzisieren die Sprachgebrauchsdomäne Internetbasierte Kommunikation, die den Gegenstand des hier beschriebenen Ressourcentyps bildet. Abschnitt 2 gibt einen Überblick zu existierenden Korpusressourcen für das Deutsche und stellt ausgewählte Korpora zu weiteren europäischen Sprachen vor. In Abschnitt 3 geben wir abschließend einen kurzen Einblick in aktuelle Forschungsfelder, die sich im Bereich der Korpuslinguistik und Sprachtechnologie in Bezug auf den Aufbau und die Aufbereitung von Korpora internetbasierter Kommunikation stellen.
Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora.