Refine
Document Type
- Part of a Book (11)
- Conference Proceeding (8)
- Article (2)
Language
- English (21) (remove)
Has Fulltext
- yes (21)
Is part of the Bibliography
- yes (21) (remove)
Keywords
- Korpus <Linguistik> (20)
- Annotation (8)
- Computerunterstützte Kommunikation (8)
- Deutsch (7)
- CMC (4)
- TEI (4)
- Text Encoding Initiative (4)
- Textkorpus (4)
- reply relations (4)
- Deutsches Referenzkorpus (DeReKo) (3)
Publicationstate
Reviewstate
- Peer-Review (14)
- Peer-review (3)
Publisher
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- IDS-Verlag (2)
- University of Antwerp (2)
- Cergy-Pontoise University, France (1)
- ELRA (1)
- European Language Resources Association (ELRA) (1)
- European language resources association (ELRA) (1)
- IDS-Verlag; Leibniz-Institut für Deutsche Sprache (IDS) (1)
- Institut für Deutsche Sprache (1)
- Leibniz-Institut für Deutsche Sprache (1)
As a consequence of a recent curation project, the Dortmund Chat Corpus is available in CLARIN-D research infrastructures for download and querying. In a legal expertise it had been recommended that standard measures of anonymisation be applied to the corpus before its republication. This paper reports about the anonymisation campaign that was conducted for the corpus. Anonymisation has been realised as categorisation, and the taxonomy of anonymisation categories applied is introduced and the method of applying it to the TEI files is demonstrated. The results of the anonymisation campaign as well as issues of quality assessment are discussed. Finally, pseudonymisation as an alternative to categorisation as a method of the anonymisation of CMC data is discussed, as well as possibilities of an automatisation of the process.
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.
The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.
CMC Corpora in DeReKo
(2017)
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively.
Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas.
In this Paper, we describe a schema and models which have been developed for the representation of corpora of computer-mediated communicatin (CMC corpora) using the representation framework provided by the Text Encoding Initiative (TEI). We characterise CMC discourse as dialogic, sequentially organised interchange between humans and point out that many features of CMC are not adequately handled by current corpus encoding schemas and tools. We formulate desiderata for a representation of CMC in encoding schemes and argue why the TEI is a suitable framework for the encoding of CMC corpora. We propose a model of basic CMC units (utterances, posts, and nonverbal activities) and the macro- and micro-level structures of interactions in CMC environments. Based on these models, we introduce CMC-core, a TEI customisation for the encoding of CMC corpora, which defines CMC-specific encoding features on the four levels of elements, model classes, attribute classes, and modules of the TEI infrastructure. The description of our customisation is illustrated by encoding examples from corpora by researchers of the TEI SIG CMC, representing a variety of CMC genres, i.e. chat, wiki talk, twitter, blog, and Second Life interactions. The material described, i.e. schemata, encoding examples, and documentation, is available from the of the TEI CMC SIG Wiki and will accompany a feature request to the TEI council in late 2019.
The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.
Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim
(2014)
Very large corpora have been built and used at the IDS since its foundation in 1964. They have been made available on the Internet since the beginning of the 90’s to currently over 30,000 researchers worldwide. The Institute provides the largest archive of written German (Deutsches Referenzkorpus, DeReKe) which has recently been extended to 24 billion words. DeReKe has been managed and analysed by engines known as COSMAS and afterwards COSMAS II, which is currently being replaced by a new, scalable analysis platform called KorAP. KorAP makes it possible to manage and analyse texts that are accompanied by multiple, potentially conflicting, grammatical and structural annotation layers, and is able to handle resources that are distributed across different, and possibly geographically distant, storage systems. The majority of texts in DeReKe are not licensed for free redistribution, hence, the COSMAS and KorAP systems offer technical solutions to facilitate research on very large corpora that are not available (and not suitable) for download. For the new KorAP system, it is also planned to provide sandboxed environments to support non-remote-API access “near the data” through which users can run their own analysis programs.
Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora.
In this paper, we discuss an efficient method of (semi-automatic) neologism detection for German and its application for the production of a dictionary of neologisms, focusing on the lexicographic process. By monitoring the language via editorial (print and online) media evaluation and interpreting the findings on the basis of lexicographic competence, many, but not all neologisms can be identified which qualify for inclusion in the Neologismenworterbuch (2006-today) at the Institute for the German Language in Mannheim (IDS). In addition, an automated corpus linguistic method offers neologism candidates based on a systematic analysis of large amounts of text to lexicographers. We explain the principles of the corpus linguistic compilation of a list of candidates and show how lexicographers work with the results, combining them with their own findings in order to continuously enlarge this specialized online dictionary of new words in German.
The International Comparable Corpus (ICC) (Kirk/Čermáková 2017; Čermáková et al. 2021) is an open initiative which aims to improve the empirical basis for contrastive linguistics by compiling comparable corpora for many languages and making them as freely available as possible as well as providing tools with which they can easily be queried and analysed. In this contribution we present the first release of written language parts of the ICC which includes corpora for Chinese, Czech, English, German, Irish (partly), and Norwegian. Each of the released corpora contains 400k words distributed over 14 different text categories according to the ICC specifications. Our poster covers the design basics of the ICC, its TEI encoding, a demonstration of using the ICC via different query tools, and an outlook on future plans.
Similar to the European Reference Corpus EuReCo (Kupietz et al. 2020), ICC follows the approach of reusing existing linguistic resources wherever possible in order to cover as many languages as possible with realistic effort in as short a time as possible. In contrast to EuReCo, however, comparable corpus pairs are not defined dynamically in the usage phase, but the compositions of the corpora are fixed in the ICC design. The approaches are thus complementary in this respect. The design principles and composition of the ICC are based on those of the International Corpus of English (ICE) (Greenbaum (ed.) 1996), with the deviation that the ICC includes the additional text category blog post and excludes spoken legal texts (see Čermáková et al. 2021 for details). ICC’s fixed-design approach has the advantage that all single-language corpora in the ICC have the same composition with respect to the selected text types and that this guarantees that the selected broad spectrum of potential influencing variables for linguistic variation is always represented. The disadvantage, however, is that this can only be achieved for quite small corpora and that the generalisability of comparative findings based on the ICC corpora will often need to be checked on larger monolingual corpora or translation corpora (Čermáková/Ebeling/Oksefjell Ebeling forthcoming). Arguing that such issues with comparability and representativeness are inevitable, in one way or the other, and need to be dealt with, our poster will discuss and exemplify the text selections in more detail.
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future research interface. Due to the recent acquisition of new licenses, DeReKo has grown by a factor of four in the first half of 2014, mostly in the area of newspaper text, and presently contains over 24 billion word tokens. Other strata, like fictional texts, web corpora, in particular CMC texts, and spoken but conceptually written texts have also increased significantly. We report on the newly acquired corpora that led to the major increase, on the principles and strategies behind our corpus acquisition activities, and on our solutions for the emerging legal, organisational, and technical challenges.
This paper analyses reply relations in computer-mediated communication (CMC), which occur between post units in CMC interactions and which describe references between posts. We take a look at existing practices in the description and annotation of such relations in chat, wiki talk, and blog corpora. We distinguish technical reply structures, indentation structures, and interpretative reply relations, which include reply relations induced by linguistic markers. We sort out the different levels of description and annotation that are involved and propose a solution for their combined representation within the TEI annotation framework.
This paper analyses reply relations in computer-mediated communication (CMC), which occur between post units in CMC interactions and which describe references between posts. We take a look at existing practices in the description and annotation of such relations in chat, wiki talk, and blog corpora. We distinguish technical reply structures, indentation structures, and interpretative reply relations, which include reply relations induced by linguistic markers. We sort out the different levels of description and annotation that are involved and propose a solution for their combined representation within the TEI annotation framework.
This paper presents an extended annotation and analysis of interpretative reply relations focusing on a comparison of reply relation types and targets between conflictual pages and neutral pages of German Wikipedia (WP) talk pages. We briefly present the different categories identified for interpretative reply relations to analyze the relationship between WP postings as well as linguistic cues for each category. We investigate referencing strategies of WP authors in discussion page postings, illustrated by means of reply relation types and targets taking into account the degree of disagreement displayed on a WP talk page. We provide richly annotated data that can be used for further analyses such as the identification of interactional relations on higher levels, or for training tasks in machine learning algorithms.
This paper discusses current trends in DeReKo, the German Reference Corpus, concerning legal issues around the recent German copyright reform with positive implications for corpus building and corpus linguistics in general, recent corpus extensions in the genres of popular magazines, journals, historical texts, and web-based football reports. Besides, DeReKo is finally accessible via the new
corpus research platform KorAP, offering registered users several news features in comparison with its predecessor COSMAS II.
When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.
When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state-ofthe-art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.
This paper presents types and annotation layers of reply relations in computer- mediated communication (CMC). Reply relations hold between post units in CMC interactions and describe references from one given post to a previous post. We classify three types of reply relations in CMC interactions: first, technical replies, i. e. the possibility to reply directly to a previous post by clicking a ‘reply’ button; second, indentations, e. g. in wiki talk pages in which users insert their contributions in the existing talk page by indenting them and third, interpretative reply relations, i. e. the reply action is not realised formally but signalled by other structural or linguistics means such as address markers ‘@’, greetings, citations and/or Q-A structures. We take a look at existing practices in the description and representation of such relations in corpora and examples of chat, Wikipedia talk pages, Twitter and blogs. We then provide an annotation proposal that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by the TEI CMC SIG. It constitutes a prerequisite for correctly identifying higher levels of interactional relations such as dialogue acts or discussion trees.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.