Refine
Document Type
- Conference Proceeding (3)
- Part of a Book (1)
Language
- English (4) (remove)
Has Fulltext
- yes (4)
Keywords
- Deutsch (4)
- Korpus <Linguistik> (4)
- Chatten <Kommunikation> (3)
- Text Encoding Initiative (TEI) (2)
- Annotation (1)
- CMC corpora (1)
- Computerunterstützte Kommunikation (1)
- Dortmunder Chat-Korpus (1)
- Elektronisches Forum (1)
- TEI encoding (1)
Publicationstate
Reviewstate
- Peer-Review (4)
Publisher
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Austrian Centre for Digital Humanities, Austrian Academy of Sciences (1)
- German Society for Computational Linguistics & Language Technology (GSCL) (1)
- Sprachwissenschaftliches Institut, Ruhr-Universität Bochum (1)
Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D
(2016)
The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format.
We introduce our pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure. The pipeline was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards. We describe how the resource has been prepared and restructured in terms of TEI encoding, linguistic annotations, and anonymisation. The output is a CLARIN-conformant resource integrated in the CLARIN-D research infrastructure.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.