Refine
Year of publication
- 2014 (50) (remove)
Document Type
- Conference Proceeding (24)
- Part of a Book (14)
- Article (7)
- Book (2)
- Other (1)
- Preprint (1)
- Working Paper (1)
Keywords
- Korpus <Linguistik> (50) (remove)
Publicationstate
- Veröffentlichungsversion (20)
- Postprint (1)
- Zweitveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (15)
- Peer-Review (6)
Publisher
- European Language Resources Association (ELRA) (9)
- Institut für Deutsche Sprache (8)
- ELRA (2)
- European Language Resources Association (2)
- de Gruyter (2)
- ACL (1)
- Association for Computational Linguistics (1)
- Association for Computational Linguistics and Dublin City University (1)
- Benjamins (1)
- Bloomsbury (1)
Das Beispiel ist seit der Antike ein zentraler Gegenstand der abendländischen Diskussion. In dieser ersten umfassenden Monographie zur Linguistik des Beispiels wird deshalb eine interdisziplinäre Perspektive entfaltet, in der Ansätze aus Rhetorik, Philosophie, Pädagogik und Psychologie sowie linguistischen Ansätze zur Beispielforschung behandelt werden. Die sprachwissenschaftliche Beschäftigung mit Beispielen blieb bisher jedoch ein Randphänomen, obwohl Praktiken der Beispielverwendung in der Alltagskommunikation allgegenwärtig sind.
Orientiert an ›grounded theory‹, linguistischer Hermeneutik und Handlungssemantik wird hier ein Beispielbegriff erarbeitet, demzufolge das Beispielverwenden eine komplexe Form sprachlichen Handelns und eine fundamentale menschliche Denkbewegung darstellt, die darin besteht, einen Konnex zwischen Besonderem und Allgemeinem zu konstituieren. Hierauf basierend werden Beispiele anhand eines umfangreichen Korpus von Gesprächsdaten analysiert und kommunikative Muster, sprachliche Realisierungsformen sowie Funktionen des Beispielverwendens in der Interaktion herausgearbeitet.
This chapter focuses on the way in which co-present parties in meetings manage language choice and treat it as raising problems of participation - in the sense that participants can orient to the fact that a given language choice may increase or diminish participation for some or all co-present group members. Choosing one language rather than another is approached here as a members' problem (in an ethnomethodological sense), and as a decision the participants make themselves, in situ and within their courses of action, displaying the way in which they orient to its local consequences, and how they justify and legitimize it. In order to explore this link between language choice and participation systematically, in this chapter we focus on a particular and recurrent phenomenon, the announcement of a language change. Within the conversation analysis framework, we analyse these announcements by taking into account the sequential position in which they occur, their format, the way in which they are addressed to a sub-group or to the group as a whole, and the specific action they accomplish. We will also look at how the group receives the announcement, its effects on the participation framework, as well as the categorizations that ensue from it. This chapter therefore highlights the mutual configuration between language choice and participation framework. Our analyses are based on several video- and audio-recorded corpora of international work meetings. These video data call for reflection not only on the linguistic dimension of participation frameworks and language switches, but more broadly on their multimodal organization. This chapter shows that multimodal details are crucial if we aim to understand the relation between multilingualism and participation as occasioned, contingent and emergent dynamics.
Annotating Spoken Language
(2014)
Vernetzung statt Vereinheitlichung. Digitale Forschungsinfrastrukturen in den Geisteswissenschaften
(2014)
Die Entwicklung der digitalen Infrastruktur am Hamburger Zentrum für Sprachkorpora (HZSK) kann als Beispiel für die Evolution individueller technischer Einzellösungen hin zu fachspezifischen virtuellen Arbeits- und Forschungsumgebungen, die im Rahmen supranationaler Forschungsinfrastrukturen für die digitalen Geisteswissenschaften miteinander vernetzt sind, angesehen werden. Im Fokus steht im konkreten Fall des HZSK die Sicherung der langfristigen Zugänglichkeit von Forschungsdaten (multimedialen Daten gesprochener Sprache) durch die Entwicklung einer virtuellen Forschungsumgebung, die einerseits an die zentrenbasierte Forschungsinfrastruktur CLARIN-D angebunden ist und andererseits fachspezifische Benutzerschnittstellen schafft.
Automatic Food Categorization from Large Unlabeled Corpora and Its Impact on Relation Extraction
(2014)
We present a weakly-supervised induction method to assign semantic information to food items. We consider two tasks of categorizations being food-type classification and the distinction of whether a food item is composite or not. The categorizations are induced by a graph-based algorithm applied on a large unlabeled domain-specific corpus. We show that the usage of a domain-specific corpus is vital. We do not only outperform a manually designed open-domain ontology but also prove the usefulness of these categorizations in relation extraction, outperforming state-of-the-art features that include syntactic information and Brown clustering.
The annotation of parts of speech (POS) in linguistically annotated corpora is a fundamental annotation layer which provides the basis for further syntactic analyses, and many NLP tools rely on POS information as input. However, most POS annotation schemes have been developed with written (newspaper) text in mind and thus do not carry over well to text from other domains and genres. Recent discussions have concentrated on the shortcomings of present POS annotation schemes with regard to their applicability to data from domains other than newspaper text.
Lexikonstatistik 2.0
(2014)
In der Mitte des 20. Jahrhunderts gab es diverse Versuche, die Klassifikation von Sprachen mit Hilfe von Wortlisten, die dem Grundvokabular der betreffenden Sprachen entnommen sind, zu automatisieren. Diese Methoden wurden und werden in der historischen Sprachwissenschaft gemeinhin kritisch diskutiert, da sich die erzielten Ergebnisse häufig als fehlerhaft erwiesen.
In den letzten Jahren erleben wir einen neuen Aufschwung lexikostatistischer und glottochronologischer Ansätze. Deren Erfolgsaussichten sind heute wesentlich besser als vor einem halben Jahrhundert, da uns jetzt große Mengen an sprachvergleichenden Daten in elektronischer Form zur Verfügung stehen und die Computerlinguistik und Bioinformatik mächtige Werkzeuge bereitstellt, diese Daten statistisch auszuwerten.
Im vorliegenden Artikel wird eine Fallstudie vorgestellt, die das Potenzial lexikostatistischer Methoden im 21. Jahrhundert illustriert.
Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data. This paper targets the detection of POS errors in automatically annotated corpora, so-called silver standards, showing that by combining different measures sensitive to annotation quality we can identify a large part of the errors and obtain a substantial increase in accuracy.
We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular political group or in particular language combinations.
Designing a Bilingual Speech Corpus for French and German Language Learners: a Two-Step Process
(2014)
We present the design of a corpus of native and non-native speech for the language pair French-German, with a special emphasis on phonetic and prosodic aspects. To our knowledge there is no suitable corpus, in terms of size and coverage, currently available for the target language pair. To select the target L1-L2 interference phenomena we prepare a small preliminary corpus (corpus1), which is analyzed for coverage and cross-checked jointly by French and German experts. Based on this analysis, target phenomena on the phonetic and phonological level are selected on the basis of the expected degree of deviation from the native performance and the frequency of occurrence. 14 speakers performed both L2 (either French or German) and L1 material (either German or French). This allowed us to test, recordings duration, recordings material, the performance of our automatic aligner software. Then, we built corpus2 taking into account what we learned about corpus1. The aims are the same but we adapted speech material to avoid too long recording sessions. 100 speakers will be recorded. The corpus (corpus1 and corpus2) will be prepared as a searchable database, available for the scientific community after completion of the project.