Refine
Year of publication
- 2014 (25) (remove)
Document Type
- Conference Proceeding (12)
- Part of a Book (8)
- Article (4)
- Other (1)
Is part of the Bibliography
- no (25) (remove)
Keywords
- Korpus <Linguistik> (25) (remove)
Publicationstate
- Veröffentlichungsversion (15)
- Postprint (1)
Reviewstate
- (Verlags)-Lektorat (9)
- Peer-Review (4)
Publisher
This chapter focuses on the way in which co-present parties in meetings manage language choice and treat it as raising problems of participation - in the sense that participants can orient to the fact that a given language choice may increase or diminish participation for some or all co-present group members. Choosing one language rather than another is approached here as a members' problem (in an ethnomethodological sense), and as a decision the participants make themselves, in situ and within their courses of action, displaying the way in which they orient to its local consequences, and how they justify and legitimize it. In order to explore this link between language choice and participation systematically, in this chapter we focus on a particular and recurrent phenomenon, the announcement of a language change. Within the conversation analysis framework, we analyse these announcements by taking into account the sequential position in which they occur, their format, the way in which they are addressed to a sub-group or to the group as a whole, and the specific action they accomplish. We will also look at how the group receives the announcement, its effects on the participation framework, as well as the categorizations that ensue from it. This chapter therefore highlights the mutual configuration between language choice and participation framework. Our analyses are based on several video- and audio-recorded corpora of international work meetings. These video data call for reflection not only on the linguistic dimension of participation frameworks and language switches, but more broadly on their multimodal organization. This chapter shows that multimodal details are crucial if we aim to understand the relation between multilingualism and participation as occasioned, contingent and emergent dynamics.
Annotating Spoken Language
(2014)
Automatic Food Categorization from Large Unlabeled Corpora and Its Impact on Relation Extraction
(2014)
We present a weakly-supervised induction method to assign semantic information to food items. We consider two tasks of categorizations being food-type classification and the distinction of whether a food item is composite or not. The categorizations are induced by a graph-based algorithm applied on a large unlabeled domain-specific corpus. We show that the usage of a domain-specific corpus is vital. We do not only outperform a manually designed open-domain ontology but also prove the usefulness of these categorizations in relation extraction, outperforming state-of-the-art features that include syntactic information and Brown clustering.
The annotation of parts of speech (POS) in linguistically annotated corpora is a fundamental annotation layer which provides the basis for further syntactic analyses, and many NLP tools rely on POS information as input. However, most POS annotation schemes have been developed with written (newspaper) text in mind and thus do not carry over well to text from other domains and genres. Recent discussions have concentrated on the shortcomings of present POS annotation schemes with regard to their applicability to data from domains other than newspaper text.
Lexikonstatistik 2.0
(2014)
In der Mitte des 20. Jahrhunderts gab es diverse Versuche, die Klassifikation von Sprachen mit Hilfe von Wortlisten, die dem Grundvokabular der betreffenden Sprachen entnommen sind, zu automatisieren. Diese Methoden wurden und werden in der historischen Sprachwissenschaft gemeinhin kritisch diskutiert, da sich die erzielten Ergebnisse häufig als fehlerhaft erwiesen.
In den letzten Jahren erleben wir einen neuen Aufschwung lexikostatistischer und glottochronologischer Ansätze. Deren Erfolgsaussichten sind heute wesentlich besser als vor einem halben Jahrhundert, da uns jetzt große Mengen an sprachvergleichenden Daten in elektronischer Form zur Verfügung stehen und die Computerlinguistik und Bioinformatik mächtige Werkzeuge bereitstellt, diese Daten statistisch auszuwerten.
Im vorliegenden Artikel wird eine Fallstudie vorgestellt, die das Potenzial lexikostatistischer Methoden im 21. Jahrhundert illustriert.
Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data. This paper targets the detection of POS errors in automatically annotated corpora, so-called silver standards, showing that by combining different measures sensitive to annotation quality we can identify a large part of the errors and obtain a substantial increase in accuracy.
We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular political group or in particular language combinations.
Designing a Bilingual Speech Corpus for French and German Language Learners: a Two-Step Process
(2014)
We present the design of a corpus of native and non-native speech for the language pair French-German, with a special emphasis on phonetic and prosodic aspects. To our knowledge there is no suitable corpus, in terms of size and coverage, currently available for the target language pair. To select the target L1-L2 interference phenomena we prepare a small preliminary corpus (corpus1), which is analyzed for coverage and cross-checked jointly by French and German experts. Based on this analysis, target phenomena on the phonetic and phonological level are selected on the basis of the expected degree of deviation from the native performance and the frequency of occurrence. 14 speakers performed both L2 (either French or German) and L1 material (either German or French). This allowed us to test, recordings duration, recordings material, the performance of our automatic aligner software. Then, we built corpus2 taking into account what we learned about corpus1. The aims are the same but we adapted speech material to avoid too long recording sessions. 100 speakers will be recorded. The corpus (corpus1 and corpus2) will be prepared as a searchable database, available for the scientific community after completion of the project.
Recent work suggests that concreteness and imageability play an important role in the meanings of figurative expressions. We investigate this idea in several ways. First, we try to define more precisely the context within which a figurative expression may occur, by parsing a corpus annotated for metaphor. Next, we add both concreteness and imageability as “features” to the parsed metaphor corpus, by marking up words in this corpus using a psycholinguistic database of scores for concreteness and imageability. Finally, we carry out detailed statistical analyses of the augmented version of the original metaphor corpus, cross-matching the features of concreteness and imageability with others in the corpus such as parts of speech and dependency relations, in order to investigate in detail the use of such features in predicting whether a given expression is metaphorical or not.
This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.
We compare several different corpus- based and lexicon-based methods for the scalar ordering of adjectives. Among them, we examine for the first time a low- resource approach based on distinctive- collexeme analysis that just requires a small predefined set of adverbial modifiers. While previous work on adjective intensity mostly assumes one single scale for all adjectives, we group adjectives into different scales which is more faithful to human perception. We also apply the methods to both polar and non-polar adjectives, showing that not all methods are equally suitable for both types of adjectives.
Accurate opinion mining requires the exact identification of the source and target of an opinion. To evaluate diverse tools, the research community relies on the existence of a gold standard corpus covering this need. Since such a corpus is currently not available for German, the Interest Group on German Sentiment Analysis decided to create such a resource and make it available to the research community in the context of a shared task. In this paper, we describe the selection of textual sources, development of annotation guidelines, and first evaluation results in the creation of a gold standard corpus for the German language.
This contribution presents the newest version of our ’Wortverbindungsfelder’ (fields of multi-word expressions), an experimental lexicographic resource that focusses on aspects of MWEs that are rarely addressed in traditional descriptions: Contexts, patterns and interrelations. The MWE fields use data from a very large corpus of written German (over 6 billion word forms) and are created in a strictly corpus-based way. In addition to traditional lexicographic descriptions, they include quantitative corpus data which is structured in new ways in order to show the usage specifics. This way of looking at MWEs gives insight in the structure of language and is especially interesting for foreign language learners.
This paper introduces the Aix Map Task corpus, a corpus of audio and video recordings of task-oriented dialogues. It was modelled after the original HCRC Map Task corpus. Lexical material was designed for the analysis of speech and prosody, as described in Astésano et al. (2007). The design of the lexical material, the protocol and some basic quantitative features of the existing corpus are presented. The corpus was collected under two communicative conditions, one audio-only condition and one face-to-face condition. The recordings took place in a studio and a sound attenuated booth respectively, with head-set microphones (and in the face-to-face condition with two video cameras). The recordings have been segmented into Inter-Pausal-Units and transcribed using transcription conventions containing actual productions and canonical forms of what was said. It is made publicly available online.
Dieser Beitrag stellt das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) und die Datenbank für Gesprochenes Deutsch (DGD) als Instrumente gesprächsanalytischer Arbeit vor. Nach einer allgemeinen Einführung in FOLK und DGD im zweiten Abschnitt werden im dritten Abschnitt die methodischen Beziehungen zwischen Korpuslinguistik und Gesprächsforschung und die Herausforde-rungen, die sich bei der Begegnung dieser beiden Herangehensweisen an authenti-sches Sprachmaterial stellen, kurz skizziert. Der vierte Abschnitt illustriert dann ausgehend vom Beispiel der Formel ich sag mal, wie eine korpus- und datenbankgesteuerte Analyse zur Untersuchung von Gesprächsphänomenen beitragen kann.
In diesem Beitrag werden zentrale methodische Fragen der Erstellung mündlicher Sprachkorpora anhand des Mannheimer FOLK-Korpus diskutiert, teils im Hinblick auf gesprochensprachliche Korpora insgesamt, teil im Vergleich zum Leipziger GeWiss-Korpus. Bei FOLK steht keine bestimmte thematisch-institutionelle Domäne im Mittelpunkt des Korpusaufbaus, sondern das Ziel, ein ausgewogenes Korpus authentischer Gespräche unterschiedlicher Sprecher/innen in Alltag, Institutionen und Medien für eine Vielzahl von Forschungsfragen und Verwendungskontexten bereitzustellen. Der Artikel stellt das Vorgehen bei der Korpus-Akquise, die Anlage der Metadaten, den Workflow des Projekts sowie die Transkriptionskonventionen und die orthografische Normalisierung der Transkriptionen ausführlich vor und beschreibt Korpusaufbau und -stratifikation sowie die Einbindung von FOLK in die Datenbank für Gesprochenes Deutsch 2.0 des IDS.
Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.
This paper analyses paramedic emergency interaction as multimodal multiactivity. Based on a corpus of video-recordings of emergency drills performed by professional paramedics during advanced training, the focus is on paramedics’ participation in multiple joint projects which become simultaneously relevant. Simultaneity and fast succession of multiactivity does not only characterise work on the team level, but also the work profile of the individual paramedic. Participants have to coordinate their own participation in more than one joint project intrapersonally. In the data studied, three patterns of allocating multimodal resources stood out as routine ways of coordinating participation in two simultaneous projects intrapersonally:
1. Talk and hearing vs. manual action monitored by gaze,
2. Talk and hearing vs. gazing (and pointing),
3. Manual action vs. gaze (and talk and hearing).
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
(2014)
We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.