Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN
- We introduce our pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure. The pipeline was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards. We describe how the resource has been prepared and restructured in terms of TEI encoding, linguistic annotations, and anonymisation. The output is a CLARIN-conformant resource integrated in the CLARIN-D research infrastructure.
Author: | Harald LüngenGND, Michael Beißwenger, Eric Ehrhardt, Axel Herold, Angelika Storrer |
---|---|
URN: | urn:nbn:de:bsz:mh39-55743 |
URL: | https://www.linguistics.ruhr-uni-bochum.de/bla/ |
ISSN: | 2190-0949 |
Parent Title (English): | Proceedings of the 13th Conference on Natural Language Processing (KONVENS) |
Series (Serial Number): | Bochumer Linguistische Arbeitsberichte (16) |
Publisher: | Sprachwissenschaftliches Institut, Ruhr-Universität Bochum |
Place of publication: | Bochum |
Editor: | Stefanie Dipper, Friedrich Neubarth, Heike Zinsmeister |
Document Type: | Part of a Book |
Language: | English |
Year of first Publication: | 2016 |
Date of Publication (online): | 2016/11/16 |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Research ressource: | http://hdl.handle.net/10932/00-03B0-14FA-A8D0-0F01-F |
GND Keyword: | Chatten <Kommunikation>; Deutsch; Korpus <Linguistik>; Text Encoding Initiative (TEI) |
First Page: | 156 |
Last Page: | 164 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
BDSL-Classification: | Textwissenschaft |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Korpuslinguistik |
Licence (German): | Urheberrechtlich geschützt |