Refine
Document Type
- Conference Proceeding (5) (remove)
Language
- English (5)
Has Fulltext
- yes (5)
Keywords
- Deutsch (4)
- Korpus <Linguistik> (4)
- Chatten <Kommunikation> (2)
- Computerunterstützte Kommunikation (2)
- Annotation (1)
- CMC corpora (1)
- Computerprogramm (1)
- Computerunterstützte Lexikografie (1)
- Datenbank (1)
- Dortmunder Chat-Korpus (1)
Publicationstate
- Veröffentlichungsversion (5) (remove)
Reviewstate
- Peer-Review (5) (remove)
Publisher
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Austrian Centre for Digital Humanities, Austrian Academy of Sciences (1)
- CLARIN (1)
- German Society for Computational Linguistics & Language Technology (GSCL) (1)
- International Committee on Computational Linguistics (1)
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.
Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D
(2016)
The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format.
This paper describes the lexical database tool LOLA (Linguistic-Oriented Lexical database Approach) which has been developed for the construction and maintenance of lexicons for the machine translation system LMT. First, the requirements such a tool should meet are discussed, then LMT and the lexical information it requires, and some issues concerning vocabulary acquisition are presented. Afterwards the architecture and the components of the LOLA system are described and it is shown how we tried to meet the requirements worked out earlier. Although LOLA originally has been designed and implemented for the German-English LMT prototype, it aimed from the beginning at a representation of lexical data that can be reused for other LMT or MT prototypes or even other NLP applications. A special point of discussion will therefore be the adaptability of the tool and its components as well as the reusability of the lexical data stored in the database for the lexicon development for LMT or for other applications.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.