P2: Mündliche Korpora
Refine
Document Type
- Conference Proceeding (20)
- Article (8)
- Part of a Book (7)
- Preprint (1)
Language
- English (36) (remove)
Has Fulltext
- yes (36)
Is part of the Bibliography
- yes (36)
Keywords
- Korpus <Linguistik> (20)
- Gesprochene Sprache (13)
- Forschungsdaten (10)
- Datenmanagement (9)
- Annotation (6)
- Deutsch (6)
- Audiovisuelles Material (5)
- Beleidigung (5)
- Automatische Spracherkennung (4)
- Beschimpfung (4)
Publicationstate
- Veröffentlichungsversion (22)
- Zweitveröffentlichung (12)
- Postprint (3)
Reviewstate
- Peer-Review (33)
- (Verlags)-Lektorat (3)
Publisher
- European Language Resources Association (8)
- Linköping University Electronic Press (5)
- Association for Computational Linguistics (4)
- CLARIN (3)
- Cambridge University Press (2)
- Digital Curation Centre (1)
- Edinburgh University Press (1)
- European Language Resources Association (ELRA) (1)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (1)
- John Benjamins (1)
Speech islands are historically and developmentally unique and will inevitably disappear within the next decades. We urgently need to preserve their remains and exploit what is left in order to make research on language-in-contact and historical as well as current comparative language research possible.
The Archive for Spoken German (AGD) at the Institute for German Language collects, fosters and archives data from completed research projects and makes them available to the wider research community.
Besides large variation corpora and corpora of conversational speech, the archive already contains a range of collections of data on German speech minorities. The latter will be outlined in this chapter. Some speech island data is already made available through the personal service of the AGD, or the database of spoken German (DGD), e.g. data on Australian German, Unserdeutsch, or German in North America. Some corpora are still being prepared for publication, but still important to document for potentially interested research projects. We therefore also explain the current problems and efforts related to the curation of speech island data, from the digitization of recordings and the collection of metadata, to the integration of transcriptions, annotations and other ways of accessing and sharing data.
This paper examines multi-unit turns that allow speakers to retrospectively close the prior sequence while prospectively launching a new sequence, which Schegloff (1986) referred to as interlocking organization. Using English telephone conversations as data, we focus on how multi-unit turns are used for topic shifts, and show that interlocking organization operates in conjunction with other phonetic and lexical features, such as increased pitch and overt markers of disjunction (e.g., “listen”). In addition, speakers utilize an audible inbreath that is placed between the first and the second units as a central interactional resource to project further talk, thereby suppressing speaker transition and possibly highlighting the action delivered in the second unit as being distinctly new. We propose that interlocking multi-unit turns, when used to make topically disjunctive moves, promote progressivity by avoiding a possible lapse in turn transition
We present a collection of (currently) about 5.500 commands directed to voice-controlled virtual assistants (VAs) by sixteen initial users of a VA system in their homes. The collection comprises recordings captured by the VA itself and with a conditional voice recorder (CVR) selectively capturing recordings including the VA-directed commands plus some surrounding context. Next to a description of the collection, we present initial findings on the patterns of use of the VA systems during the first weeks after installation, including usage timing, the development of usage frequency, distributions of sentence structures across commands, and (the development of) command success rates. We discuss the advantages and disadvantages of the applied collection-specific recording approach and describe potential research questions that can be investigated in the future, based on the collection, as well as the merit of combining quantitative corpus linguistic approaches with qualitative in-depth analyses of single cases.
We address the task of distinguishing implicitly abusive sentences on identity groups (“Muslims contaminate our planet”) from other group-related negative polar sentences (“Muslims despise terrorism”). Implicitly abusive language are utterances not conveyed by abusive words (e.g. “bimbo” or “scum”). So far, the detection of such utterances could not be properly addressed since existing datasets displaying a high degree of implicit abuse are fairly biased. Following the recently-proposed strategy to solve implicit abuse by separately addressing its different subtypes, we present a new focused and less biased dataset that consists of the subtype of atomic negative sentences about identity groups. For that task, we model components that each address one facet of such implicit abuse, i.e. depiction as perpetrators, aspectual classification and non-conformist views. The approach generalizes across different identity groups and languages.
This paper presents a compositional annotation scheme to capture the clusivity properties of personal pronouns in context, that is their ability to construct and manage in-groups and out-groups by including/excluding the audience and/or non-speech act participants in reference to groups that also include the speaker. We apply and test our schema on pronoun instances in speeches taken from the German parliament. The speeches cover a time period from 2017-2021 and comprise manual annotations for 3,126 sentences. We achieve high inter-annotator agreement for our new schema, with a Cohen’s κ in the range of 89.7-93.2 and a percentage agreement of > 96%. Our exploratory analysis of in/exclusive pronoun use in the parliamentary setting provides some face validity for our new schema. Finally, we present baseline experiments for automatically predicting clusivity in political debates, with promising results for many referential constellations, yielding an overall 84.9% micro F1 for all pronouns.
This paper describes the TEI-based ISO standard 24624:2016 ‘Transcription of spoken language’ and other formats used within CLARIN for spoken language resources. It assesses the current state of support for the standard and the interoperability between these formats and with rele- vant tools and services. The main idea behind the paper is that a digital infrastructure providing language resources and services to researchers should also allow the combined use of resources and/or services from different contexts. This requires syntactic and semantic interoperability. We propose a solution based on the ISO/TEI format and describe the necessary steps for this format to work as an exchange format with basic semantic interoperability for spoken language resources across the CLARIN infrastructure and beyond.
While the role of intentions in the constitution of actions gives rise to complex and heavily controversial questions, it appears to be indisputable that action ascription in interaction mostly does without any overt ascription of intention. Yet, sometimes participants explicitly ascribe intentions to their interlocutors in order to make sense of their prior actions. The chapter examines intention ascriptions in response to a partner’s adjacent prior turn using the German modal verb construction willst du/wollen Sie (do you want). The analysis focuses on the aspect of the prior action the intention ascription addresses (action type, projected next action, motive etc.), the action the intention ascription performs itself, and the next action they make relevant from the prior speaker. It was found that intention ascriptions are used to clarify and intersubjectively ground the meaning of the prior turn, which seems otherwise underspecified, ambiguous or puzzling. Yet, they are also used to adumbrate criticism, e.g., that the prior turn projects a course of future actions which is considered to be inadequate, or to expose a concealed, problematic allegedly “real” meaning of the prior turn.
Action ascription is an emergent process of mutual displays of understanding. Usually, the kind of action that is ascribed to a prior turn by a next action remains implicit. Sometimes, however, actions are overtly ascribed, for example, when speakers expose the use of strategies. This happens particularly in conflictual interaction, such as public debates or mediation talks. In these interactional settings, one of the speakers’ goals is to discredit their opponents in front of other participants or an overhearing audience. This chapter investigates different types of overt strategy ascriptions in a public mediation: exposing the opponent’s use of rhetorical devices, exposing the opponent’s use of false premises, and exposing that an opponent is telling only a half-truth. This chapter shows how speakers use ascriptions of acting strategically as accusations to disclose their opponents’ intentions and ‘truths’ that the opponents allegedly conceal and that are detrimental to their position.
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.