Refine
Document Type
- Article (8)
- Conference Proceeding (6)
- Part of a Book (5)
Language
- English (19) (remove)
Keywords
- Gesprochene Sprache (19) (remove)
Publicationstate
- Zweitveröffentlichung (19) (remove)
Reviewstate
- Peer-Review (14)
- (Verlags)-Lektorat (5)
Publisher
- European Language Resources Association (4)
- Benjamins (2)
- Springer (2)
- Buske (1)
- CLARIN (1)
- Digital Curation Centre (1)
- Edinburgh University Press (1)
- Editorial Universitat Politècnica de València (1)
- Narr (1)
- Niemeyer (1)
Conversation is usually considered to be grammatically simple, while academic writing is often claimed to be structurally complex, associated primarily with a greater use of dependent clauses. Our goal in the present paper is to challenge these stereotypes, based on the results of large-scale corpus investigations. We argue that both conversation and professional academic writing are grammatically complex but that their complexities are dramatically different. Surprisingly, the traditional view that complexity is realized through extensive clausal embedding leads to the conclusion that conversation is more complex than academic writing. In contrast, written academic discourse is actually much more ‘compressed’ than elaborated, and the complexities of academic writing are realized mostly as phrasal embedding rather than embedded clauses.
Since Lerner coined the notion of delayed completion in 1989, this recurrent social practice of continuing one’s speaking turn while disregarding an intermediate co-participant’s utterance has not been investigated with regard to embodied displays and actions. A sequential approach to videotaped mundane conversations in German will explain the occurrence and use of delayed completions. First, especially in multi-party and multi-activity settings, delayed completions can result from reduced monitoring and coordinating activities. Second, recipients can use intra-turn response slots for more extended responsive actions than the current speaker initially projected, leading to delayed completion sequences. Finally, delayed completions are used for blocking possibly misaligned co-participant actions. The investigation of visible action illustrates that delayed completions are a basic practice for retrospectively managing co-participant response slots.
The use of digital resources and tools across humanities disciplines is steadily increasing, giving rise to new research paradigms and associated methods that are commonly subsumed under the term digital humanities. Digital humanities does not constitute a new discipline in itself, but rather a new approach to humanities research that cuts across different existing humanities disciplines. While digital humanities extends well beyond language-based research, textual resources and spoken language materials play a central role in most humanities disciplines.
In this paper, we present an overview of freely available web applications providing online access to spoken language corpora. We explore and discuss various solutions with which the corpus providers and corpus platform developers address the needs of researchers who are working with spoken language. The paper aims to contribute to the long-overdue exchange and discussion of methods and best practices in the design of online access to spoken language corpora.
Older adults are often exposed to elderspeak, a specialized speech register linked with negative outcomes. However, previous research has mainly been conducted in nursing homes without considering multiple contextual conditions. Based on a novel contextually-driven framework, we examined elderspeak in an acute general versus geriatric German hospital setting. Individuallevel information such as cognitive impairment (CI) and audio-recorded data from care interactions between 105 older patients (M = 83.2 years; 49% with severe CI) and 34 registered nurses (M = 38.9 years) were assessed. Psycholinguistic analyses were based on manual coding (k = .85 to k = .97) and computer-assisted procedures. First, diminutives (61%), collective pronouns (70%), and tag questions (97%) were detected. Second, patients’ functional impairment emerged as an important factor for elderspeak. Our study suggests that functional impairment may be a more salient trigger of stereotype activation than CI and that elderspeak deserves more attention in acute hospital settings.
Lean syntax: how argument structure is adapted to its interactive, material, and temporal ecology
(2020)
It has often been argued that argument structure in spoken discourse is less complex than in written discourse. This paper argues that lean argument structure, in particular, argument omission, gives evidence of how the production and understanding of linguistic structures is adapted to the interactive, material, and temporal ecology of talk-in-interaction. It is shown how lean argument structure builds on participants' ongoing bodily conduct, joint perceptual salience, joint attention, and their Orientation to expectable next actions within a joint project. The phenomena discusscd in this paper are verb-derived discourse markers and tags, analepsis in responsive actions, and ellipsis in first actions, such as requests and instructions. The study draws from transcripts and audio- and video-recordings of naturally occurring interaction in German from the Research and Teaching Corpus of Spoken German (FOLK).
This paper describes the development of a systematic approach to the creation, management and curation of linguistic resources, particularly spoken language corpora. It also presents first steps towards a framework for continuous quality control to be used within external research projects by non-technical users, and discuss various domain and discipline specific problems and individual solutions. The creation of spoken language corpora is not only a time-consuming and costly process, but the created resources often represent intangible cultural heritage, containing recordings of, for example, extinct languages or historical events. Since high quality resources are needed to enable re-use in as many future contexts as possible, researchers need to be provided with the necessary means for quality control. We believe that this includes methods and tools adapted to Humanities researchers as non-technical users, and that these methods and tools need to be developed to support existing tasks and goals of research projects.
This paper studies practices of indexing discrepant assumptions accomplished by turn-constructional units with ich dachte ('I thought') in German talk-in-interaction. Building on the analysis of 141 instances from the corpus FOLK, we identify three sequential environments in which ich dachte is used to index that an assumption which a speaker (has) held contrasts with some other, contextually salient assumption. We show that practices which have been studied for English I thought are also routinely used in German: ich dachte is a means to manage epistemic incongruencies and to contrast an incorrect with a correct assumption in narratives. In addition, ich dachte is also used to account for the speaker's own prior actions which may have looked problematic because they built on misunderstandings which the speaker only discovered later. Moreover, ich dachte-practices may also be used to create comic effects by reporting an earlier, absurd assumption. The practices are discussed with regard to their role in regaining common ground, in managing relationships, in maintaining the identity of a rational actor, and in terms of their exploitation for other conversational interests. Special attention is paid to how co-occurring linguistic features, and sequential and pragmatic factors, account for local interpretations of ich dachte.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and LAN(guage) and also adding label classes for various numeric and temporal expressions. Applying the scheme to the spoken data as well as a collection of teaser tweets from newspaper sites, we can confirm its generality for both domains, also achieving good inter-annotator agreement. We also show empirically how our inventory relates to the well-established 4-category NER inventory by re-annotating a subset of the GermEval 2014 NER coarse-grained dataset with our fine label inventory. Finally, we use a BERT-based system to establish some baselines for NER tagging on our two new datasets. Global results in in-domain testing are quite high on the two datasets, near what was achieved for the coarse inventory on the CoNLLL2003 data. Cross-domain testing produces much lower results due to the severe domain differences.
This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem as a sentence pair classification task, as compared to a sequence tagging approach.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
This paper discusses the technological and methodological challenges in creating and sharing HAMATAC, the Hamburg Map Task Corpus. The first version of the corpus, consisting of 24 recordings with orthographic transcriptions and metadata, is publicly available. A second version featuring different types of linguistic annotation is in progress. I will describe how the various software tools and data formats of the EXMARaLDA system were used for transcription and multi-level annotation, to compile recordings and transcriptions into a corpus and manage metadata, to publish the corpus, and how they can be used for carrying out corpus queries (KWIC) and analyses. Some recurrent issues in corpus building and sharing and the interaction of technological and methodological aspects will be illustrated using HAMATAC.
We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 "Language resource management - Transcription of spoken language". The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.
This paper investigates emergent pseudo-coordination in spoken German. In a corpus-based study, seven verbs in the first conjunct are analyzed regarding the degree of semantic bleaching and the development of subjective or aspectual meaning components. Moreover, it is shown that each verb shows distinct tendencies for co-ocurrences, especially with deictic adverbs in the first conjunct and with specific verbs and verb classes in the second conjunct. It is argued that pseudo-coordination is originally motivated by the need for ‘chunking’ in unplanned speech and that it is still prominently used in this function in German, in contrast to languages in which pseudo-coordination is grammaticalized further.
Response particles manage intersubjectivity. This conversation analytic study describes German eben (“exactly”). With eben, speaker A locally agrees with the immediately prior turn of B (the “confirmable”) and establishes a second indexical link: A relates B’s confirmable to a position A herself had already displayed (the “anchor”). Through claiming temporal priority, eben speakers treat a just-formulated position as self-evident and mark independence. Further evidence for the three-part structure “anchor-confirmable-eben” that eben sets in motion retrospectively comes from instances where eben speakers supply a missing/opaque anchor via a postpositioned display of independent access. Data are in German with English translation.
The lexicography of German
(2020)
This chapter discusses the main dictionaries of the German language as it is spoken and written in Germany, and also German as it is spoken and written in Austria, Switzerland, the eastern fringes of Belgium, and South Tyrol. It also briefly describes Pennsylvania German. Corpora and other language resources used in German dictionary-making are also presented. Finally, there is a discussion of some current issues in German lexicography, as well as future prospects.