Refine
Year of publication
- 2019 (137) (remove)
Document Type
- Article (53)
- Conference Proceeding (30)
- Part of a Book (23)
- Book (15)
- Other (5)
- Review (5)
- Working Paper (4)
- Part of Periodical (2)
Language
- German (76)
- English (57)
- Multiple languages (2)
- Ukrainian (2)
Keywords
- Deutsch (46)
- Korpus <Linguistik> (31)
- Gesprochene Sprache (13)
- Automatische Sprachanalyse (12)
- Konversationsanalyse (9)
- corpus linguistics (9)
- Interaktion (7)
- Social Media (7)
- Kommunikation (6)
- Sprachstatistik (6)
Publicationstate
- Veröffentlichungsversion (137) (remove)
Reviewstate
- Peer-Review (75)
- (Verlags)-Lektorat (57)
Publisher
- Leibniz-Institut für Deutsche Sprache (IDS) (26)
- Leibniz-Institut für Deutsche Sprache (19)
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (9)
- Lexical Computing CZ s.r.o. (6)
- Institut für Deutsche Sprache (5)
- de Gruyter (5)
- Heidelberg University Publishing (4)
- Narr (4)
- The Association for Computational Linguistics (4)
- Spektrum der Wissenschaft Verlagsgesellschaft (3)
The article deals with communicative failures of journalists in “YouTube” celebrity video interviews in the Ukrainian and German linguacultures from the point of view of social interaction and the theory of speech genres at all structural levels of the communicative genre construction, establishing common and distinctive features in both linguacultures. The analysis made it possible to conclude that behind a language (speech) failure there is a violation caused by a journalist, a respondent, or an external noise.
Статтю присвячено дослідженню комунікативних невдач у мовленнєвому жанрі відеоінтерв’ю крізь призму української національної ідентичності. Визначено тематику, типи і жанрово-мовну специфіку українського відеоінтерв’ю як зразка діалогічного мовлення. Встановлено специфіку комунікативних невдач у цьому жанрі (зі спортсменами, політиками і культурними діячами) з огляду на позиції комунікантів, структурні рівні досліджуваного жанру та максими спілкування.
Die lexikografische Behandlung von Neologismen aus der Perspektive hispanophoner DaF-Lernender
(2019)
Anhand von einigen medialen Kommunikationsverben wie mailen oder twittern wird das lexikografische Informationsangebot zu Neologismen auf seine Adäquatheit für die fremdsprachige Produktion untersucht. Die Untersuchung erfolgt aus der Perspektive eines spanischsprachigen DaF-Lernenden. Zur Analyse werden sowohl Neologismenwörterbücher und -datenbanken für das Deutsche als auch gängige, bilinguale Online-Wörterbücher für das Sprachenpaar Spanisch–Deutsch gezogen. Die Ergebnisse der lexikografischen Untersuchung werden exemplarisch mit korpusbasierten Daten aus einer Doktorarbeit verglichen. Die Befunde zeigen den Bedarf und die Notwendigkeit auf, die lexikografische Behandlung von (verbalen) Neologismen im spanisch–deutschen Kontext zu optimieren. Dabei soll — insbesondere — die fremdsprachige Textproduktion berücksichtigt werden.
Lektürehinweis
(2019)
Das Handbuch ist eine periodische und mehrsprachige Online-Publikation. Die bisher veröffentlichten Bände wurden bereits über 8.500 Mal heruntergeladen. Für Leserinnen und Leser, die das haptische Leseerlebnis bevorzugen, ist die Publikation zudem im Printformat erhältlich. Zu ausgewählten Konzepten der Sprachkritik werden sukzessive enzyklopädische Artikel veröffentlicht, die ein sprachkritisches Schlüsselkonzept betreffen und die für die europäische Perspektive von kultureller Bedeutung sind. Das Ziel ist demnach, eine Konzeptgeschichte der europäischen Sprachkritik zu präsentieren. Zum einen liefert das Handbuch einen spezifischen Blick auf die jeweiligen Sprachkulturen. Zum anderen werden diese vergleichend in den Blick genommen.
Einleitung
(2019)
Der vorliegende Band „Sprachinstitutionen und Sprachkritik“ weist eine unmittelbare Verbindung zu den ersten drei Bänden unserer Handbuchreihe und der Frage auf, wie sich das viel diskutierte und diskursiv konstituierte Konzept der sprachlichen Normierung und Standardisierung einer Nationalsprache im Vergleich der Sprachkulturen entwickelt hat und wie es sich aktuell wandelt. Diese Gesichtspunkte lassen aufschlussreiche Verbindungen zum ersten Handbuchband „Sprachnormierung und Sprachkritik“ erkennen, aber auch zum zweiten („Standardisierung und Sprachkritik“) und zum dritten Handbuchband („Sprachpurismus und Sprachkritik“).
We report on a new project building a Natural Language Processing resource for Zulu by making use of resources already available. Combining tagging results with the results of morphological analysis semi-automatically, we expect to reduce the amount of manual work when generating a finely-grained gold standard corpus usable for training a tagger. From the tagged corpus, we plan to extract verb-argument pairs with the aim of compiling a verb valency lexicon for Zulu.
Mehrsprachigkeitsdiskurse im Bildungskontext in Lettland zwischen Populismus und Weltoffenheit
(2019)
Unser Aufsatz diskutiert aktuelle Debatten zu Sprachen und Mehrsprachigkeit im Bildungssystem in Lettland. Theoretischer Hintergrund sind Debatten zur Mehrsprachigkeit, zu Spracheinstellungen und zur heteroglossischen Ideologie. Nach einer kurzen historischen Einführung in Fragen des sprachlichen Ökosystems Lettlands stellen wir Beispiele aus der aktuellen Reform der Schulcurricula vor, die Mehrsprachigkeitsansätze aufgreifen. Diese Reformversuche werden allerdings durch weit verbreitete Diskurse in der lettischen Gesellschaft abgelehnt. Anhand von Reaktionen von Bildungspolitikern und in journalistischen Texten zeigen wir, wie einflussreich traditionelle Vorstellungen vom Sprachlernen nach wie vor sind und wie eine Modernisierung des Lettischunterrichts mit Fragen von nationaler Identität verbunden wird, in denen bisweilen sogar offen xenophobisch argumentiert wird. Gleichzeitig wird deutlich, wie im Diskurs im Interesse der „Rettung der lettischen Sprache“ mit Mythen und Halbwahrheiten operiert wird. Der dritte Teil des Aufsatzes stellt in diesem Kontext zwei Studien unter Lehrern in Lettland vor, in denen explizit nach Einstellungen und Praktiken zu Code-Switching, Translanguaging und ähnlichen Phänomenen gefragt wurde. In den Antworten zeigt sich die Spaltung der Gesellschaft; jedoch zeigen die Ergebnisse auch, dass Perspektiven für einen modernen und mehrsprachigen Sprachunterricht in Lettland durchaus vorhanden sind.
This paper analyzes the LL in the city of Bautzen / Budyšin in Germany, a town which is frequently considered the “capital” of the Slavonic minority of the Sorbs. It focuses on the societal role of Sorbian in relation to practices and ideologies of mainstream German society. The vast majority of signs in Bautzen / Budyšin are in German only. Sorbian is essentially restricted to explicitly Sorbian institutions and to local and regional administration. Interviews conducted in shops and on the streets reveal that paternalistic attitudes common to perceptions of language policies and minority languages in Germany dominate; practices maintain the common monolingual habitus in German society. Members of the majority population show little awareness of Sorbian issues, and Sorbian signage is seen as a generous gesture but considered essentially unnecessary. Only in most recent times, a reaction by the Sorbian community has challenged these practices and attitudes.
This edited collection provides an overview of linguistic diversity, societal discourses and interaction between majorities and minorities in the Baltic States. It presents a wide range of methods and research paradigms including folk linguistics, discourse analysis, narrative analyses, code alternation, ethnographic observations, language learning motivation, languages in education and language acquisition. Grouped thematically, its chapters examine regional varieties and minority languages (Latgalian, Võro, urban dialects in Lithuania, Polish in Lithuania); the integration of the Russian language and its speakers; and the role of international languages like English in Baltic societies. The editors’ introductory and concluding chapters provide a comparative perspective that situates these issues within the particular history of the region and broader debates on language and nationalism at a time of both increased globalization and ethno-regionalism. This book will appeal in particular to students and scholars of multilingualism, sociolinguistics, language discourses and language policy, and provide a valuable resource for researchers focusing on Baltic States, Northern Europe and the post-Soviet world in the related fields of history, political science, sociology and anthropology.
This chapter investigates differences in language regards in Latvia and Estonia. Based on the results of a survey that had about 1000 respondents in each country, it analyses general views on languages and language-learning motivation, as well as specific regards of Estonian, Latvian, Russian, English, German and other languages. The results show that languages and language learning are generally important for the respondents; language-learning motivation is overwhelmingly instrumental. Besides the obvious value of the titular languages of each country, English and Russian are to differing degrees considered of importance for professional and leisure purposes, ahead of German, Finnish (in Estonia) and French, whereas other languages are of little relevance. In more emotionally related categories, differences are more salient. L1-speakers of Russian differ in their views from L1-speakers of Estonian and Latvian, indicating that the linguistic acculturation of society in Estonia tends to be more monodirectional towards Estonian, whereas in Latvia there are more bidirectional tendencies as both Latvian and Russian L1-speakers regard each other’s languages as at least moderately relevant.
Resistance and adaptation to newspeakerness in educational institutions: two tales from Estonia
(2019)
The term ‘new speaker’ has recently emerged as an attempt by sociolinguists not only to understand the diferent types of speaker profles that can be found in contemporary societies, but also to grasp the underlying processes of becoming a legitimate speaker in a given society. In this article, we combine the results from two studies situated in two educational institutions in Estonia in order to fnd out about speakers’ language attitudes and experiences in connection to learning and using Estonian. We concentrate on members of the international community who have relatively recently arrived to the country. Our results indicate that these speakers fuctuate between two prototypical discourses, which we broadly dub as ‘resistance’ and ‘adaptation’ to newspeakerness. Our study thereby adds to current debates on ‘new speaker’ and language policy issues by illustrating how tensions around language legitimacy are played out on the ground in a small nation state such as Estonia.
This paper focuses on so called syntactic projection phenomena in the German language. This term from the German Gesprächsforschung is used to define the fact that an utterance or part of it foreshadows another one. This paper aims at pointing out how such projection phenomena are consciously exploited for rhethorical purposes. This will be observed on the basis of excerpts from the Stuttgart 21 mediation talks. The linguistic analysis carried out in this paper will focus on syntactic projection phenomena involving the use of causal adverbial connectives deshalb and deswegen.
This paper aims at investigating the usage of present subjunctive (Konjunktiv I), which is traditionally labelled as a feature of standard written language and therefore as typically occurring in communication genres based on it such as press texts and reporting, in everyday spoken German. Through an analysis of corpus data performed according to theory and method of Interactional Linguistics and encompassing private, institutional and public interactional domains, the paper will show how this particular verb form expresses different epistemic stances according to its syntactic embedment.
Preface
(2019)
Der vorliegende Beitrag setzt sich mit dem computergestützten Transkriptionsverfahren arabisch-deutscher Gesprächsdaten für interaktionsbezogene Untersuchungen auseinander. Zunächst werden wesentliche methodische Herausforderungen der gesprächsanalytischen Arbeit adressiert: Hinsichtlich der derzeitigen Korpustechnologie ermöglicht die Verwendung von arabischen Schriftzeichen in einem mehrsprachigen, bidirektionalen Transkript keine analysegerechte Rekonstruktion von Reziprozität, Linearität und Simultaneität sprachlichen Handelns. Zudem ist die Verschriftung von arabischen Gesprächsdaten aufgrund der unzureichenden (gesprächsanalytischen) Beschäftigung mit den standardfernen Varietäten und gesprochensprachlichen Phänomenen erschwert. Daher widmet sich der zweite Teil des Beitrags den bisher erarbeiteten und erprobten Lösungsansätzen ̶ einem stringenten, gesprächsanalytisch fundierten Transkriptionssystem für gesprochenes Arabisch.
The paper deals with the process of computer-aided transcription regarding Arabic-German data material for interaction-based studies. First of all, it sheds light upon some major methodological challenges posed by the conversation-analytic approaches: due to current corpus technology, the reciprocity, linearity, and simultaneity of linguistic activities cannot be reconstructed in an analytically proper way when using the Arabic characters in multilingual and bidirectional transcripts. The difficulty of transcribing Arabic encounters is also compounded by the fact that Spoken Arabic as well as its varieties and phenomena have not been standardised enough (for conversation-analytic purposes). Therefore, the second part of this paper is dedicated to preliminary, self-developed solutions, namely a systematic method for transcribing Spoken Arabic.
Content
1 Predicting learner knowledge of individual words using machine learning
Drilon Avdiu, Vanessa Bui, Klára Ptacinová Klimci´ková
2 Automatic Generation and Semantic Grading of Esperanto Sentences in a Teaching Context
Eckhard Bick
3 Toward automatic improvement of language produced by non-native language learners
Mathias Creutz, Eetu Sjöblom
4 Linguistic features and proficiency classification in L2 Spanish and L2 Portuguese
Iria del Ri´o
5 Integrating large-scale web data and curated corpus data in a search engine supporting German literacy education
Sabrina Dittrich, Zarah Weiss, Hannes Schröter, Detmar Meurers
6 Formalism for a language agnostic language learning game and productive grid generation
Sylvain Hatier, Arnaud Bey, Mathieu Loiseau
7 Understanding Vocabulary Growth Through An Adaptive Language Learning System
Elma Kerz, Andreas Burgdorf, Daniel Wiechmann, Stefan Meeger,Yu Qiao, Christian Kohlschein, Tobias Meisen
8 Summarization Evaluation meets Short-Answer Grading
Margot Mieskes, Ulrike Padó
9 Experiments on Non-native Speech Assessment and its Consistency
Ziwei Zhou, Sowmya Vajjala, Seyed Vahid Mirnezami
10 The Impact of Spelling Correction and Task Context on Short Answer Assessment for Intelligent Tutoring Systems
Ramon Ziai, Florian Nuxoll, Kordula De Kuthy, Björn Rudzewitz, Detmar Meurers
You might not know what a “smombie” is, but you have certainly already met one today. In public streets and places, the so-called “smartphone zombies” regularly cross our ways. They walk slowly, in peculiar ways, their eyes and fingers focused on their smartphone displays. While some cities have already introduced specific walking lanes or ground-level traffic signs for smartphone users “on the go”, it is not only road safety that is at stake. Frequently hunching over our phones causes cervical pain, we are addicted to likes on social media, and the fear of missing out prevents us from switching off our phones. If asked if mobile device use is possibly harmful to our bodies and minds, most people would spontaneously agree. Our social skills seem to constantly diminish since smartphones have become an everyday tool: we stick to them like glue while waiting for the bus, while walking, while eating, even while being with others. Will we turn into social zombies in the end?
Special Issue: Mobile Medienpraktiken im Spannungsfeld von Öffentlichkeit, Privatheit und Anonymität
(2019)
Speech planning is a sophisticated process. In dialog, it regularly starts in overlap with an incoming turn by a conversation partner. We show that planning spoken responses in overlap with incoming turns is associated with higher processing load than planning in silence. In a dialogic experiment, participants took turns with a confederate describing lists of objects. The confederate’s utterances (to which participants responded) were pre-recorded and varied in whether they ended in a verb or an object noun and whether this ending was predictable or not. We found that response planning in overlap with sentence-final verbs evokes larger task-evoked pupillary responses, while end predictability had no effect. This finding indicates that planning in overlap leads to higher processing load for next speakers in dialog and that next speakers do not proactively modulate the time course of their response planning based on their predictions of turn endings. The turn-taking system exerts pressure on the language processing system by pushing speakers to plan in overlap despite the ensuing increase in processing load.
Im Folgenden soll der Versuch unternommen werden, die bisherigen Forschungen zur Sprachgeschichte des Nationalsozialismus im Lichte ihres Erkenntnispotenzials für gegenwärtige historiografische Fragestellungen zu diskutieren. In einem ersten Schritt wird die Forschungsgeschichte zur Sprache des beziehungsweise im Nationalsozialismus von der frühen Nachkriegszeit bis in die 1980er-Jahre skizziert. Deutlich wird hier vor allem, dass in dieser Phase zwar wichtige Arbeiten entstanden sind, jedoch methodische und theoretische Begrenzungen zahlreiche blinde Flecken bestehen ließen. In einem zweiten Schritt wird dargelegt, mit welchen Erkenntnisinteressen und Instrumentarien die jüngere Forschung sich auf diese blinden Flecken konzentriert und welche Ergebnisse sie zutage gefördert hat. Dabei sollen vor allem diejenigen Aspekte benannt werden, die nach Ansicht des Verfassers bei einer noch zu schreibenden Sprach- und Kommunikationsgeschichte des Nationalsozialismus zu beachten sind. Es handelt sich konkret um eine Ausweitung der Perspektive in Richtung unterschiedlicher Kommunikationssituationen und heterogener Akteurskonstellationen, um Sprachgebräuche unter den Diskursbedingungen des Nationalsozialismus angemessen beschreiben zu können.
In German oral discourse, previous research has shown that okay can be used both as a response token (e.g., for agreeing with the previous turn or for claiming a certain degree of understanding) and as a discourse marker (e.g., for closing conversational topics or sequences and/or indicating transitions). This contribution focuses on the use of okay as a response token and how it is connected with the speakers’ interactional state of knowledge (their understanding, their assumptions etc.). The analysis is based on video recorded everyday conversations in German and a sequential, micro-analytic approach (multimodal conversation analysis). The main function of conversational okay in the selected data set is related to indicating the acceptance of prior information. By okay, speakers however claim acceptance of a piece of information that they can’t verify or check. The analysis contrasts different sequences containing okay only with sequences in which change-of-state tokens such as ah and achso co-occur with okay. This illustrates that okay itself does not index prior information as new, and that it is not used for agreeing with or for confirming prior information. Instead it enables the speaker to adopt a kind of neutral, “non-agreeing” position towards a given piece of information.
In this chapter, we discuss steps toward extending CMDI’s semantic interoperability beyond the Social Sciences and Humanities: We stress the need for an initial data curation step, in part supported by a relation registry that helps impose some structure on CMDI vocabulary; we describe the use of authority file information and other controlled vocabulary to help connecting CMDI-based metadata to existing Linked Data; we show how significant parts of CMDI-based metadata can be converted to bibliographic metadata standards and hence entered into library catalogs; and finally we describe first steps to convert CMDI-based metadata to RDF. The initial grassroots approach of CMDI (meaning that anybody can define metadata descriptors and components) mirrors the AAA slogan of the Semantic Web (“Anyone can say Anything about Any topic”). Ironically, this makes it hard to fully link CMDI-based metadata to other Semantic Web datasets. This paper discusses the challenges of this enterprise.
Zu den Beiträgen des Themenhefts.
Die in dem Themenheft versammelten Beiträge setzen sich unter verschiedenen Fragestellungen, im Rahmen unterschiedlicher methodischer Ansätze und jeweils eigener Datensets mit Öffentlichkeit, Privatheit und Anonymität im kommunikativen Handeln mit mobilen Medien auseinander.
This paper investigates self-initiated uses of mobile phones (such as texting or making a call) in everyday video-recorded conversations among Czech speakers. Using ethnomethodological conversation analysis, it illustrates how participants publicly frame their own device use (for example, by announcements), and how co-present interlocutors respond to it. Previous studies have described how participants manage two concurrent communicative involvements, but have not provided detailed sequential descriptions of how device use can be negotiated and accounted for. This study shows that mobile device use in co-presence is not a priori problematic (or vice versa). Instead, participants frame their technology use in different ways according to various features of the social situation they treat as momentarily relevant. These features include the course of the conversation and how the device use relates to it, the overall participation framework and the opacity of the device use for co-present others.
Man muss glaube ich unterscheiden. Poetisch zu sein ist das eine, literarisch das andere. Das lässt sich auch auf die zugehörigen schwierigen Substantive beziehen, die Literarizität und die Poetizität. Und dann kann man auch über die poetische Funktion nachdenken, einem Postulat aus der Ergänzung der Bühler’schen Funktionstrias im Kontext des Prager Funktionalismus, die wir Roman Jakobson verdanken. Dass man unterscheiden muss, gilt vor allem oder auch noch mehr in einer nicht mehr (so) regelgeleiteten Moderne – und für eine sprachwissenschaftlich basierte Antwort.
We investigate whether prototypicality or prominence of semantic roles can account for role-related effects in sentence interpretation. We present two acceptability-rating experiments testing three different constructions: active, personal passive and DO-clefts involving the same type of transitive verbs that differ with respect to the agentive role features they select. Our results reveal that there is no cross-constructional advantage for prototypical roles (e.g., agents), hence disconfirming a central tenet of role prototypicality. Rather, acceptability clines depend on the construction under investigation, thereby highlighting different role features. This finding is in line with one core assumption of the prominence account stating that role features are flexibly highlighted depending on the discourse function of the respective construction.
Although the N400 was originally discovered in a paradigm designed to elicit a P300 (Kutas and Hillyard, 1980), its relationship with the P300 and how both overlapping event-related potentials (ERPs) determine behavioral profiles is still elusive. Here we conducted an ERP (N = 20) and a multiple-response speed-accuracy tradeoff (SAT) experiment (N = 16) on distinct participant samples using an antonym paradigm (The opposite of black is white/nice/yellow with acceptability judgment). We hypothesized that SAT profiles incorporate processes of task-related decision-making (P300) and stimulus-related expectation violation (N400). We replicated previous ERP results (Roehm et al., 2007): in the correct condition (white), the expected target elicits a P300, while both expectation violations engender an N400 [reduced for related (yellow) vs. unrelated targets (nice)]. Using multivariate Bayesian mixed-effects models, we modeled the P300 and N400 responses simultaneously and found that correlation between residuals and subject-level random effects of each response window was minimal, suggesting that the components are largely independent. For the SAT data, we found that antonyms and unrelated targets had a similar slope (rate of increase in accuracy over time) and an asymptote at ceiling, while related targets showed both a lower slope and a lower asymptote, reaching only approximately 80% accuracy. Using a GLMM-based approach (Davidson and Martin, 2013), we modeled these dynamics using response time and condition as predictors. Replacing the predictor for condition with the averaged P300 and N400 amplitudes from the ERP experiment, we achieved identical model performance. We then examined the piecewise contribution of the P300 and N400 amplitudes with partial effects (see Hohenstein and Kliegl, 2015). Unsurprisingly, the P300 amplitude was the strongest contributor to the SAT-curve in the antonym condition and the N400 was the strongest contributor in the unrelated condition. In brief, this is the first demonstration of how overlapping ERP responses in one sample of participants predict behavioral SAT profiles of another sample. The P300 and N400 reflect two independent but interacting processes and the competition between these processes is reflected differently in behavioral parameters of speed and accuracy.
Wird aus Sprache Gewalt?
(2019)
Nach dem Mord am Kasseler Regierungspräsidenten Walter Lübcke und weiteren Mordanschlägen in der jüngsten Vergangenheit wurde in Kommentaren und Stellungnahmen immer wieder behauptet, dass hier Sprache in Gewalt umgeschlagen sei. Dies ist einerseits naheliegend vor dem Hintergrund dessen, was wir über die Täter und ihre Äußerungen wissen. Was aber sagt die Wissenschaft dazu? Wie ist aus sprach- und kommunikationswissenschaftlicher Sicht dieser angenommene Zusammenhang zu bewerten?
In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a loss-less standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard ’Transcription of spoken language’ with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the “Tsakorpus” search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future.
In this paper, we present WebAnno-MM, an extension of the popular web-based annotation tool WebAnno, which is designed for the linguistic annotation of transcribed spoken data with time aligned media files. Several new features have been implemented for our current use case: a novel teaching method based on pair-wise manual annotation of transcribed video data and systematic comparison of agreement between students. To enable the annotation of transcribed spoken language data, apart from technical and data model related challenges, WebAnno-MM offers an additional view to data: a (musical) score view for the inspection of parallel utterances, which is relevant for various methodological research questions regarding the analysis of interactions of spoken content.
Nach dem Mord am Kasseler Regierungspräsidenten Walter Lübcke und weiteren Mordanschlägen in der jüngsten Vergangenheit wurde in Kommentaren und Stellungnahmen immer wieder behauptet, dass hier Sprache in Gewalt umgeschlagen sei. Dies ist einerseits naheliegend vor dem Hintergrund dessen, was wir über die Täter und ihre Äußerungen wissen. Was aber sagt die Wissenschaft dazu?
Für das Deutsche gibt es keine staatliche Institution, abgesehen vom Rat für deutsche Rechtschreibung, die den politischen Auftrag hat, sprachliche Normierung und schließlich die Standardisierung einer Nationalsprache legitimiert zu begleiten. Vor diesem Hintergrund werden in dem Artikel die verschiedenen nichtstaatlichen Sprachinstitutionen im Deutschen dargelegt. Im deutschen Sprachraum waren die Sprachakademien der Nachbarländer stets Vorbild, herausgebildet haben sich aber vor allem Sprachgesellschaften und Sprachvereine, die sich sprachkultivierend einbrachten.
Handbuch europäische Sprachkritik online (HESO). Band 4: Sprachinstitutionen und Sprachkritik
(2019)
Sprachinstitutionen haben in den einzelnen hier behandelten Sprachen in unterschiedlichem Maße die sprachkultivierende, sprachnormierende und auch sprachkritische Auseinandersetzung befördert. Zu den bekanntesten und in gewissem Sinne auch symbolträchtigsten Sprachinstitutionen zählen zweifellos Sprachakademien wie die Accademia della Crusca oder die Académie française. Neben Sprachakademien stehen mitunter auch Sprachgesellschaften für eine nicht nur sprachnormierende, sondern in Teilen auch sprachpuristische Auseinandersetzung.
Transdisciplinary research is research not only on, but also for and, most of all, with practitioners. In the research framework of transdisciplinarity, scholars and practitioners collaborate throughout research projects with the aim of mutual learning. This paper shows the value transdisciplinarity can add to media linguistics. It does so by investigating the digital literacy shift in journalism: the change, in the last two decades, from the predominance of a writing mode that we have termed focused writing to a mode we have called writing-by-the-way. Large corpora of writing process data have been generated and analyzed with the multimethod approach of progression analysis in order to combine analytical depth with breadth. On the object level of doing writing in journalism, results show that the general trend towards writing-by-the-way opens up new niches for focused writing. On a meta level of doing research, findings explain under what conditions transdisciplinarity allows for deeper insights into the medialinguistic object of investigation.
Modern theoretical linguistics lives by the insight that the meanings of complex expressions derive from the meanings of their parts and the way these are composed. However, the currently dominating theories of the syntax-semantics interface hastily relegate important aspects of meaning which cannot readily be aligned with visible structure to empty projecting heads non-reductively (mainstream Generative Grammar) or to the syntactic construction holistically (Construction Grammar). This book develops an alternative, compositional analysis of the hidden aspectual-temporal, modal and comparative meanings of a range of productive constructions of which pseudorefl exive, excessive and directional complement constructions take center stage. Accordingly, a contradiction-inducing hence semantically problematic part of literally coded meaning is locally ignored and systematically realized „expatriately“ with respect to parts of structure that achieve the indexical anchoring of propositional contents in terms of times, worlds and standards of comparison, thus yielding the observed hidden meanings.
We propose a Cross-lingual Encoder-Decoder model that simultaneously translates and generates sentences with Semantic Role Labeling annotations in a resource-poor target language. Unlike annotation projection techniques, our model does not need parallel data during inference time. Our approach can be applied in monolingual, multilingual and cross-lingual settings and is able to produce dependencybased and span-based SRL annotations. We benchmark the labeling performance of our model in different monolingual and multilingual settings using well-known SRL datasets. We then train our model in a cross-lingual setting to generate new SRL labeled data. Finally, we measure the effectiveness of our method by using the generated data to augment the training basis for resource-poor languages and perform manual evaluation to show that it produces high-quality sentences and assigns accurate semantic role annotations. Our proposed architecture offers a flexible method for leveraging SRL data in multiple languages.
Ein CERN der deutschen Sprache. Überlegungen zu einem Dokumentationszentrum der deutschen Sprache
(2019)
In an earlier publication it was claimed that there is no useful relationship between Swahili-English dictionary look-up frequencies and the occurrence frequencies for the same wordforms in Swahili-English corpora, at least not beyond the top few thousand wordforms. This result was challenged using data for German by a different team of researchers using an improved methodology. In the present article the original Swahili-English data is revisited, using ten years’ worth of it rather than just two, and using the improved methodology. We conclude that there is indeed a positive relationship. In addition, we show that online dictionary look-up behaviour is remarkably similar across languages, even when, as in our case, one is dealing with languages from very dissimilar language families. Furthermore, online dictionaries turn out to have minimum look-up success rates, below which they simply cannot go. These minima are language-sensitive and vary depending on the regularity of the searched-for entries, but are otherwise constant no matter the size of randomly sampled dictionaries. Corpus-informed sampling always improves on any random method. Lastly, from the point of view of the graphical user interface, we argue that the average user of an online bilingual dictionary is better served with a single search box, rather than separate search boxes for each dictionary side.
Looking at gestures as a means for communication, they can serve conversational participants at several levels. As co-speech gestures, they can add information to the verbally expressed content and they can serve to manage turn-taking. In order to look closer at the interplay between these resources in face-to face conversation, we annotated hand gestures, syntactic completion points and the related turn-organisation, and measured the timing of gesture strokes and their lexical/phrasal referent. In a case study on German, we observe the trend that speakers vary less in gesturelexis on- and offsets when keeping the turn after syntactic completions than at speaker changes, backchannel or other locations of a conversation. This indicates that timing properties of non-verbal cues interact with verbal cues to manage turn-taking.
Das Archiv für Gesprochenes Deutsch (AGD, Stift/Schmidt 2014) am Leibniz-Institut für Deutsche Sprache ist ein Forschungsdatenzentrum für Korpora des gesprochenen Deutsch. Gegründet als Deutsches Spracharchiv (DSAv) im Jahre 1932 hat es über Eigenprojekte, Kooperationen und Übernahmen von Daten aus abgeschlossenen Forschungsprojekten einen Bestand von bald 100 Variations-, Interview- und Gesprächskorpora aufgebaut, die u. a. dialektalen Sprachgebrauch, mündliche Kommunikationsformen oder die Sprachverwendung bestimmter Sprechertypen oder zu bestimmten Themen dokumentieren. Heute ist dieser Bestand fast vollständig digitalisiert und wird zu einem großen Teil der wissenschaftlichen Gemeinschaft über die Datenbank für Gesprochenes Deutsch (DGD) im Internet zur Nutzung in Forschung und Lehre angeboten.
Vorwort
(2019)
Sprechen im Umbruch. Zeitzeugen erzählen und argumentieren rund um den Fall der Mauer im Wendekorpus
(2019)
The Lehnwortportal Deutsch (2012 seqq.) serves as an integrated online information system on German lexical borrowings into other languages, synthesizing an increasing number of lexicographical dictionaries and providing basic cross-resource search options. The paper discusses the far-reaching revision of the system’s conceptual, lexicographical and technological underpinnings currently under way, focussing on their relevance for multilingual loanword lexicography.
A Supervised learning approach for the extraction of opinion sources and targets from German text
(2019)
We present the first systematic supervised learning approach for the extraction of opinion sources and targets on German language data. A wide choice of different features is presented, particularly syntactic features and generalization features. We point out specific differences between opinion sources and targets. Moreover, we explain why implicit sources can be extracted even with fairly generic features. In order to ensure comparability our classifier is trained and tested on the dataset of the STEPS shared task.
We present a descriptive analysis on the two datasets from the shared task on Source, Subjective Expression and Target Extraction from Political Speeches (STEPS), the only existing German dataset for opinion role extraction of its size. Our analysis discusses the individual properties of the three components, subjective expressions, sources and targets and their relations towards each other. Our observations should help practitioners and researchers when building a system to extract opinion roles from German data.
We present the second edition of the GermEval Shared Task on the Identification of Offensive Language. This shared task deals with the classification of German tweets from Twitter. Two subtasks were continued from the first edition, namely a coarse-grained binary classification task and a fine-grained multi-class classification task. As a novel subtask, we introduce the classification of offensive tweets as explicit or implicit.
The shared task had 13 participating groups submitting 28 runs for the coarse-grained
task, another 28 runs for the fine-grained task, and 17 runs for the implicit-explicit
task.
We evaluate the results of the systems submitted to the shared task. The shared task homepage can be found at https://projects.fzai.h-da.de/iggsa/
Der Beitrag beschreibt ein mehrfach annotiertes Korpus deutschsprachiger Songtexte als Datenbasis für interdisziplinäre Untersuchungsszenarien. Die Ressource erlaubt empirisch begründete Analysen sprachlicher Phänomene, systemischstruktureller Wechselbeziehungen und Tendenzen in den Texten moderner Popmusik. Vorgestellt werden Design und Annotationen des in thematische und autorenspezifische Archive stratifizierten Korpus sowie deskriptive Statistiken am Beispiel des Udo-Lindenberg-Archivs.
Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we investigate the impact of task setup and data representation, encoding of context information as well as different model architectures for this task.
This paper presents first steps towards metaphor detection in German poetry, in particular in expressionist poems. We create a dataset with adjective-noun pairs extracted from expressionist poems, manually annotated for metaphoricity. We discuss the annotation process and present models and experiments for metaphor detection where we investigate the impact of context and the domain dependence of the models.
In this paper, we present our work-inprogress to automatically identify free indirect representation (FI), a type of thought representation used in literary texts. With a deep learning approach using contextual string embeddings, we achieve f1 scores between 0.45 and 0.5 (sentence-based evaluation for the FI category) on two very different German corpora, a clear improvement on earlier attempts for this task. We show how consistently marked direct speech can help in this task. In our evaluation, we also consider human inter-annotator scores and thus address measures of certainty for this difficult phenomenon.
This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction.
Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.
This contribution aims to describe privacy, publicness and anonymity as essential analytic dimensions for media linguistic research. The dimensions are not inherent in and predetermined by the technical features and forms of communication provided by mobile devices, but are used by the participants as an orientation grid for shaping their online and offline practices in and with mobile media. Consid-ering both mobile device use in the public realm and the dissemina-tion of increasingly private content in social media (which is said to lead to ‘blurred boundaries’ between the private and the public), the paper provides a brief overview of the main developments in mobile media research: Studies adopting various approaches – e. g. socio-logical-ethnographic, linguistic and media studies – illustrate how publicness, privacy and anonymity are actively shaped and brought about by mobile media users in face-to-face and remote social en-counters. As this shows that publicness, privacy and anonymity are still relevant concepts for users, future media linguistics studies should focus on the dynamic multimodal practices by which they are contextualized and accomplished.
Narratives 2.0. A Multi-dimensional approach to semi-public storytelling in WhatsApp voice messages
(2019)
Based on a corpus of voice message narratives in German WhatsApp group chats, the present study contributes to research on social media storytelling in that it focusses on stories of personal experience which are embedded in a communication platform which favours a continuous dialogic exchange, narrated to well-defined non-anonymous publics and multimodal (comprised of visual and audible posting types). To capture the characteristics of this type of social media storytelling, the paper argues that Ochs and Capps’ (2001) dimensional model originally developed for conversational narratives (including the dimensions of tellability, tellership, embeddedness, linearity, moral stance) should be expanded by the dimensions of publicness, multimodality and sequencing. The prototype of storytelling in WhatsApp group chats is based on recent personal experiences; it is related by a single teller as an initial, sequentially non-embedded and linearly organised “big package” story (in a single voice message sometimes introduced by a text message containing an abstract); other group members routinely document their evaluative stances in rather conventionalised text message responses in the semi-public group space.
Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas.
Smooth turn-taking in conversation depends in part on speakers being able to communicate their intention to hold or cede the floor. Both prosodic and gestural cues have been shown to be used in this context. We investigate the interplay of pitch movements and hand gestures at locations at which speaker change becomes relevant, comparing their use in German and Swedish. We find that there are some shared functions of prosody and gesture with regard to turn-taking in the two languages, but that these shared functions appear to be mediated by the different phonological demands on pitch in the two languages.
Classical null hypothesis significance tests are not appropriate in corpus linguistics, because the randomness assumption underlying these testing procedures is not fulfilled. Nevertheless, there are numerous scenarios where it would be beneficial to have some kind of test in order to judge the relevance of a result (e.g. a difference between two corpora) by answering the question whether the attribute of interest is pronounced enough to warrant the conclusion that it is substantial and not due to chance. In this paper, I outline such a test.
Intergroup conflict im Sprachgebrauch rechtspopulistischer Gruppierungen am Beispiel von "Pegida"
(2019)
Populismus spaltet Gesellschaften – so lautet eine häufig zu hörende und zu lesende Auffassung. Als offensichtlichste Form der Spaltung erscheint dabei die gruppenbezogene Spaltung zwischen denjenigen, die populistischen Bewegungen und Parteien anhängen und denjenigen, die das mehr oder weniger entschieden nicht tun. Die Risse in der Gesellschaft zeigen sich jedoch nicht nur in Bezug auf diesen Gruppenkonflikt. Er ist nur eine Linie in einem Netz von tatsächlichen oder auch nur wahrgenommenen und rhetorisch konstruierten Frakturen, die von populistischen Gruppierungen hervorgehoben oder möglicherweise auch erst geschaffen werden und Eingang in den öffentlichen Diskurs finden.
Muskelversagen? Großartig! - Framing von Fachbegriffen aufgrund unterschiedlichen Weltwissens
(2019)
Most authors agree that modal particles - a dass of function words widely considered characteristic of Modem German - cannot receive prosodic stress, though the reasons for this restriction have not yet been satisfactorily explained. This paper argues that unstressability follows from the general contribution of modal particles to compositional utterance meaning, which requires them to take scope over focus-background structures. Form and function of modal particle meanings are modelled and illustrated for five representative examples - the particles wohl, ja, eigentlich, eben and halt. It is argued that these as well as other particles, whenever they occur under prosodic stress, cannot preserve the meaning nor the syntactic behaviour of modal particles. All instances of stressed particles in German must therefore be categorized in other functional classes.
Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
Nearly all of the very large corpora of English are “static”, which allows a wide range of one-time, pre-processed data, such as collocates. The challenge comes with large “dynamic” corpora, which are updated regularly, and where preprocessing is much more difficult. This paper provides an overview of the NOW corpus (News on the Web), which is currently 8.2 billion words in size, and which grows by about 170 million words each month. We discuss the architecture of NOW, and provide many examples that show how data from NOW can (uniquely) be extracted to look at a wide range of ongoing changes in English.
As the Web ought to be considered as a series of sources rather than as a source in itself, a problem facing corpus construction resides in meta-information and categorization. In addition, we need focused data to shed light on particular subfields of the digital public sphere. Blogs are relevant to that end, especially if the resulting web texts can be extracted along with metadata and made available in coherent and clearly describable collections.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.
Contents:
1. Johannes Graën, Tannon Kew, Anastassia Shaitarova and Martin Volk, "Modelling Large Parallel Corpora", S. 1-8
2. Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary, "Asynchronous Pipelines for Processing Huge Corpora on Medium to Low Resource Infrastructures", S. 9-16
3. Vladimír Benko, "Deduplication in Large Web Corpora", S. 17-22
4. Mark Davies, "The best of both worlds: Multi-billion word “dynamic” corpora", S. 23-28
5. Adrien Barbaresi, "On the need for domain-focused web corpora", S. 29-32
6. Marc Kupietz, Eliza Margaretha, Nils Diewald, Harald Lüngen and Peter Fankhauser, "What's New in EuReCo? Interoperability, Comparable Corpora, Licensing", S. 33-39
Forms of verbal violence, such as Hate Speech and Cyberbullying, currently are issues with high societal relevance. In the social discourse they are associated with brutalization of social interaction. Against this background it is necessary to integrate the topic into school lessons. This chapter outlines a teaching unit on verbal violence in the digital age. It has been developed together with students and can therefore be used in German classes but it is also relevant for use in academic teaching.
Wenn Menschen heute auf der Straße protestieren, sind immer auch digitale Medien im Spiel. Ob zur Mobilisierung oder Koordination, zur Vor- oder Nachbereitung: Proteste sind durch die sprachlichen Interaktionen und medialen Praktiken der Teilnehmenden bestimmt. Mark Dang-Anh widmet sich der situativen Protestkommunikation in digitalen Medien mit einem Fokus auf Interaktionen im Mikrobloggingdienst Twitter. Anhand zweier Falluntersuchungen von Protesten gegen rechte Aufmärsche analysiert er die vielschichtigen Relationen zwischen Sprache, Medien und der sozialen Praxis des Protestierens.
Wie werden Wörter im Deutschen und im Englischen geschrieben? Wo sind Gemeinsamkeiten, wo sind Unterschiede? Diese Fragen werden aus morphologisch-graphematischer Perspektive bearbeitet. Es geht hier also nicht um Bezüge zwischen Schrift und Lautform (traditionell oft im Fokus der Graphematik), sondern um Korrespondenzen zwischen Schrift und Morphologie. Das betrifft zum einen den Aufbau von Morphemen. Welche Beschränkungen lassen sich hier für die Abfolge der Buchstaben formulieren? Was sind minimale, was sind prototypische Stämme und Affixe? Zum anderen geht es um Fragen der Einheitlichkeit (Wie uniform wird ein Morphem in der Schrift repräsentiert?) und der Eindeutigkeit (Wie distinkt verweist eine Schreibung auf ein Morphem?). Insgesamt zeigt sich, dass im Englischen eher Affixe verlässlich kodiert werden (oft eindeutig und einheitlich), während im Deutschen häufig Stämme einheitlich kodiert werden. Das sind zwei grundsätzlich unterschiedliche Strategien der Leseerleichterung.
Seit langem schon wurde nicht mehr so intensiv über Sprache und Sprachgebrauch in der Öffentlichkeit debattiert. Sprache fungiert dabei möglicherweise als das Schlachtfeld, auf dem die gegenwärtigen identitätspolitischen Kämpfe ausgetragen werden. Neben anderen sprachlichen Erscheinungsformen soll insbesondere durch Metaphern die Bereitschaft geschaffen werden, bestimmte politische Handlungsziele zu unterstützen. Manche Metaphern verzerren aber die Gegebenheiten auf unheilvolle Weise. In einer politischen Auseinandersetzung kann eine Reflexion über Sprache zu einer Verständigung beitragen. Auch in den Medien sollte häufiger auf die originelle Metapher verzichten und stattdessen eine differenziertere sprachliche Form gewählt werden.
We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.