OPUS 4 | Search

(Best) Practices for Annotating and Representing CMC and Social Media Corpora in CLARIN-D (2016)

Beißwenger, Michael ; Ehrhardt, Eric ; Herold, Axel ; Lüngen, Harald ; Storrer, Angelika

The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.

A text-technological approach to automatic discourse analysis of complex texts (2006)

Hilbert, Mirco ; Lobin, Henning ; Bärenfänger, Maja ; Lüngen, Harald ; Puskás, Csilla

This paper describes the development of a relational discourse parsing architecture for text documents of a complex text type, namely scientific articles. To achieve this goal, several different linguistic knowledge sources and auxiliary analyses on different linguistic levels are necessary.

Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D (2016)

Beißwenger, Michael ; Ehrhardt, Eric ; Herold, Axel ; Lüngen, Harald ; Storrer, Angelika

The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format.

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014 (2014)

Lüngen, Harald ; Kupietz, Marc

Das Dortmunder Chat-Korpus in CLARIN-D: Modellierung und Mehrwerte (2016)

Beißwenger, Michael ; Herold, Axel ; Lüngen, Harald ; Storrer, Angelika

Demonstration des SemDok-Textparsers (2008)

Hilbert, Mirco ; Lüngen, Harald ; Bärenfänger, Maja ; Lobin, Henning

Im Teilprojekt CI “SemDok” der DFG-Forschergruppe Texttechnologische Informationsmodellierung wurde ein Textparser für Diskursstrukturen wissenschaftlicher Zeitschriftenartikel nach der Rhetorical Structure Theory entwickelt. Die wesentlichen konzeptuellen und technischen Merkmale des Chart-Parsers und die sich daraus ergebenden Parametrisierungsmöglichkeiten für Parsing-Experimente werden beschrieben. Zudem wird HPVtz., ein Tool für die Visualisierung von Parsing-Ergebnissen (RST-Bäume in einer XML-Anwendung) und die Navigation in ihnen, vorgestellt.

Enhancing speech corpus resources with multiple lexical tag layers (2000)

Witt, Andreas ; Lüngen, Harald ; Gibbon, Dafydd

We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).

GOLD and Discourse: Domain- and Community-Specific Extensions (2005)

Goecke, Daniela ; Lüngen, Harald ; Sasaki, Felix ; Witt, Andreas ; Farrar, Scott

Igel: Comparing document grammars using XQuery (2013)

Sperberg-McQueen, Christopher M. ; Schonefeld, Oliver ; Kupietz, Marc ; Lüngen, Harald ; Witt, Andreas

Igel is a small XQuery-based web application for examining a collection of document grammars; in particular, for comparing related document grammars to get a better overview of their differences and similarities. In its initial form, Igel reads only DTDs and provides only simple lists of constructs in them (elements, attributes, notations, parameter entities). Our continuing work is aimed at making Igel provide more sophisticated and useful information about document grammars and building the application into a useful tool for the analysis (and the maintenance!) of families of related document grammars

Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN (2016)

Lüngen, Harald ; Beißwenger, Michael ; Ehrhardt, Eric ; Herold, Axel ; Storrer, Angelika

We introduce our pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure. The pipeline was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards. We describe how the resource has been prepared and restructured in terms of TEI encoding, linguistic annotations, and anonymisation. The output is a CLARIN-conformant resource integrated in the CLARIN-D research infrastructure.

Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects (2016)

Beißwenger, Michael ; Chanier, Thierry ; Chiari, Isabella ; Erjavec, Tomaž ; Fišer, Darja ; Herold, Axel ; Ljubešić, Nikola ; Lüngen, Harald ; Poudat, Céline ; Stemle, Egon W. ; Storrer, Angelika ; Wigham, Ciara

The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte (2012)

Lüngen, Harald ; Hebborn, Mariana

Methods for the semantic analysis of document markup (2003)

Bayerl, Petra Saskia ; Lüngen, Harald ; Goecke, Daniela ; Witt, Andreas ; Naber, Daniel

We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals. For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e. topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles’ document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document. For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses.

Multi-Dimensional Markup: N-way relations as a generalisation over possible relations between annotation layers (2008)

Lüngen, Harald ; Witt, Andreas

Sprachressourcen in der Lehre – Erfahrungen, Einsatzszenarien, Nutzerwünsche (2011)

Binder, Frank ; Lüngen, Harald ; Lobin, Henning

The FAIR Index of CMC Corpora (2020)

Frey, Jennifer-Carmen ; König, Alexander ; Stemle, Egon ; Falaise, Achille ; Fišer, Darja ; Lüngen, Harald

In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data management, we evaluate the 20 European CMC corpora listed in the CLARIN CMC Resource family, individuate successful strategies among the existing corpora and establish best practices for future projects. We give an overview of existing approaches to data referencing, dissemination and provision in European CMC corpora, and discuss the methods, formats and strategies used. Furthermore, we discuss the need for community standards and offer recommendations for best practices when creating a new CMC corpus.

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls (2009)

Belica, Cyril ; Kupietz, Marc ; Witt, Andreas ; Lüngen, Harald

The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.

Unification of XML Documents with Concurrent Markup (2004)

Witt, Andreas ; Lüngen, Harald ; Sasaki, Felix ; Goecke, Daniela

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken (2015)

Lüngen, Harald ; Keibel, Holger

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken (2013)

Lüngen, Harald ; Keibel, Holger

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken (2014)

Lüngen, Harald ; Keibel, Holger