Refine
Year of publication
- 2019 (223) (remove)
Document Type
- Part of a Book (80)
- Article (76)
- Conference Proceeding (25)
- Book (22)
- Other (7)
- Review (6)
- Part of Periodical (3)
- Working Paper (3)
- Report (1)
Language
- German (153)
- English (68)
- Multiple languages (1)
- Chinese (1)
Is part of the Bibliography
- yes (223) (remove)
Keywords
- Deutsch (92)
- Korpus <Linguistik> (48)
- Gesprochene Sprache (18)
- Automatische Sprachanalyse (11)
- Grammatik (11)
- Sprache (11)
- Interaktionsanalyse (10)
- Sprachstatistik (10)
- Pragmatik (9)
- Annotation (8)
Publicationstate
- Zweitveröffentlichung (111)
- Veröffentlichungsversion (79)
- Postprint (16)
- Erstveröffentlichung (2)
Reviewstate
- (Verlags)-Lektorat (92)
- Peer-Review (92)
- (Verlags-)Lektorat (2)
- Peer review (1)
- Peer-review (1)
Publisher
- de Gruyter (41)
- Leibniz-Institut für Deutsche Sprache (IDS) (18)
- Erich Schmidt (13)
- Narr Francke Attempto (10)
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (8)
- Leibniz-Institut für Deutsche Sprache (7)
- Lang (6)
- Narr (6)
- Winter (6)
- Buske (5)
Wird aus Sprache Gewalt?
(2019)
Nach dem Mord am Kasseler Regierungspräsidenten Walter Lübcke und weiteren Mordanschlägen in der jüngsten Vergangenheit wurde in Kommentaren und Stellungnahmen immer wieder behauptet, dass hier Sprache in Gewalt umgeschlagen sei. Dies ist einerseits naheliegend vor dem Hintergrund dessen, was wir über die Täter und ihre Äußerungen wissen. Was aber sagt die Wissenschaft dazu? Wie ist aus sprach- und kommunikationswissenschaftlicher Sicht dieser angenommene Zusammenhang zu bewerten?
Wörterbuchartikel
(2019)
Nach dem Mord am Kasseler Regierungspräsidenten Walter Lübcke und weiteren Mordanschlägen in der jüngsten Vergangenheit wurde in Kommentaren und Stellungnahmen immer wieder behauptet, dass hier Sprache in Gewalt umgeschlagen sei. Dies ist einerseits naheliegend vor dem Hintergrund dessen, was wir über die Täter und ihre Äußerungen wissen. Was aber sagt die Wissenschaft dazu?
Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we investigate the impact of task setup and data representation, encoding of context information as well as different model architectures for this task.
We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.
We examine the new task of detecting derogatory compounds (e.g. curry muncher). Derogatory compounds are much more difficult to detect than derogatory unigrams (e.g. idiot) since they are more sparsely represented in lexical resources previously found effective for this task (e.g. Wiktionary). We propose an unsupervised classification approach that incorporates linguistic properties of compounds. It mostly depends on a simple distributional representation. We compare our approach against previously established methods proposed for extracting derogatory unigrams.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.
Distributional models of word use constitute an indispensable tool in corpus based lexicological research for discovering paradigmatic relations and syntagmatic patterns (Belica et al. 2010). Recently, word embeddings (Mikolov et al. 2013) have revived the field by allowing to construct and analyze distributional models on very large corpora. This is accomplished by reducing the very high dimensionality of word cooccurrence contexts, the size of the vocabulary, to few dimensions, such as 100-200. However, word use and meaning can vary widely along dimensions such as domain, register, and time, and word embeddings tend to represent only the most prevalent meaning. In this paper we thus construct domain specific word embeddings to allow for systematically analyzing variations in word use. Moreover, we also demonstrate how to reconstruct domain specific co-occurrence contexts from the dense word embeddings.
Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas.
This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction.
Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.
In this paper, we present WebAnno-MM, an extension of the popular web-based annotation tool WebAnno, which is designed for the linguistic annotation of transcribed spoken data with time aligned media files. Several new features have been implemented for our current use case: a novel teaching method based on pair-wise manual annotation of transcribed video data and systematic comparison of agreement between students. To enable the annotation of transcribed spoken language data, apart from technical and data model related challenges, WebAnno-MM offers an additional view to data: a (musical) score view for the inspection of parallel utterances, which is relevant for various methodological research questions regarding the analysis of interactions of spoken content.
In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a loss-less standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard ’Transcription of spoken language’ with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the “Tsakorpus” search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future.
The Lehnwortportal Deutsch (2012 seqq.) serves as an integrated online information system on German lexical borrowings into other languages, synthesizing an increasing number of lexicographical dictionaries and providing basic cross-resource search options. The paper discusses the far-reaching revision of the system’s conceptual, lexicographical and technological underpinnings currently under way, focussing on their relevance for multilingual loanword lexicography.
A "polyglottal" speech synthesis - modifications for a replica of Kempelen's speaking machine
(2019)
Smooth turn-taking in conversation depends in part on speakers being able to communicate their intention to hold or cede the floor. Both prosodic and gestural cues have been shown to be used in this context. We investigate the interplay of pitch movements and hand gestures at locations at which speaker change becomes relevant, comparing their use in German and Swedish. We find that there are some shared functions of prosody and gesture with regard to turn-taking in the two languages, but that these shared functions appear to be mediated by the different phonological demands on pitch in the two languages.
In diesem Beitrag wird das Redewiedergabe-Korpus (RW-Korpus) vorgestellt, ein historisches Korpus fiktionaler und nicht-fiktionaler Texte, das eine detaillierte manuelle Annotation mit Redewiedergabeformen enthält. Das Korpus entsteht im Rahmen eines laufenden DFG-Projekts und ist noch nicht endgültig abgeschlossen, jedoch ist für Frühjahr 2019 ein Beta-Release geplant, welches der Forschungsgemeinschaft zur Verfügung gestellt wird. Das endgültige Release soll im Frühjahr 2020 erfolgen. Das RW-Korpus stellt eine neuartige Ressource für die Redewiedergabe-Forschung dar, die in dieser Detailliertheit für das Deutsche bisher nicht verfügbar ist, und kann sowohl für quantitative linguistische und literaturwissenschaftliche Untersuchungen als auch als Trainingsmaterial für maschinelles Lernen dienen.
This paper describes a rule-based approach to detect direct speech without the help of any quotation markers. As datasets fictional and non-fictional texts were used. Our evaluation shows that the results appear stable throughout different datasets in the fictional domain and are comparable to the results achieved in related work.
Graphenbasierte Ansätze spielen in der digitalen Lexikografie eine immer stärkere Rolle. Essentiell für die Erstellung, Verwaltung und Nutzung graphenbasierter lexikografischer Ressourcen ist jedoch eine leistungsfähige und zugleich einfach zu handhabende Zugriffsstruktur, die Suchen nach komplexen Konstellationen in solchen Graphen ermöglicht. Für heutige Graphendatenbanken stehen zahlreiche Abfragesprachen zur Verfügung, deren Verwendung jedoch verhältnismäßig voraussetzungsreich ist.
Das Poster stellt einen webbasierten, frei konfigurierbaren Query Builder vor, der die Formulierung semantisch sehr komplexer Suchabfragen an eine (mit dem Tinkerpop-Standard kompatible) Property-Graphendatenbank ermöglicht. Die Abfrage erfolgt durch einfaches visuell-interaktives Zusammenstellen hierarchisch angeordneter Abfrageelemente und liefert Antworten in Echtzeit. Dabei wird von den Komplexitäten der verwendeten Low-level-Abfragesprache Gremlin abstrahiert. Der Query Builder ist ein zentrales Modul eines derzeit entwickelten Open-Source-Softwaresystems zur Verwaltung und Online-Publikation graph-erweiterter lexikografischer Ressourcen.
This paper presents types and annotation layers of reply relations in computer- mediated communication (CMC). Reply relations hold between post units in CMC interactions and describe references from one given post to a previous post. We classify three types of reply relations in CMC interactions: first, technical replies, i. e. the possibility to reply directly to a previous post by clicking a ‘reply’ button; second, indentations, e. g. in wiki talk pages in which users insert their contributions in the existing talk page by indenting them and third, interpretative reply relations, i. e. the reply action is not realised formally but signalled by other structural or linguistics means such as address markers ‘@’, greetings, citations and/or Q-A structures. We take a look at existing practices in the description and representation of such relations in corpora and examples of chat, Wikipedia talk pages, Twitter and blogs. We then provide an annotation proposal that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by the TEI CMC SIG. It constitutes a prerequisite for correctly identifying higher levels of interactional relations such as dialogue acts or discussion trees.
Classical null hypothesis significance tests are not appropriate in corpus linguistics, because the randomness assumption underlying these testing procedures is not fulfilled. Nevertheless, there are numerous scenarios where it would be beneficial to have some kind of test in order to judge the relevance of a result (e.g. a difference between two corpora) by answering the question whether the attribute of interest is pronounced enough to warrant the conclusion that it is substantial and not due to chance. In this paper, I outline such a test.
Lexikographische und lexikalische Ressourcen zum Deutschen werden an vielen unterschiedlichen Institutionen erarbeitet. Zum einen im Dudenverlag, der mit den gedruckten Wörterbüchern der Duden-Reihe und mit „Duden online“ die meistkonsultierten gegenwartssprachlichen Wörterbücher zum Deutschen erstellt, dann die Union deutscher Akademien, unter deren Dach an verschiedenen einzelnen Akademien zahlreiche historische wie auch synchrone Wörterbücher zum Deutschen erstellt werden (z. B. das „Digitale Wörterbuch der deutschen Sprache“, das „Wörterbuchnetz“ sowie das geplante Informationssystem des neuen „Zentrums für digitale Lexikographie der deutschen Sprache“). Auch am Institut für Deutsche Sprache in Mannheim werden wissenschaftliche wortschatzbezogene Ressourcen zum Deutschen erarbeitet und der (Fach-)Öffentlichkeit unter dem Dach von OWID, dem „Online-Wortschatz-Informationssystem Deutsch“, präsentiert. Obwohl wir uns in OWID auf Ressourcen zu spezialisierten Wortschatzbereichen konzentriert haben, erreichen wir Nutzerinnen und Nutzer in verschiedensten Ländern der Welt. Wir wollen hier die Gelegenheit wahrnehmen, den ZGL-Leserinnen und -Lesern unsere Ressourcen in OWID und OWIDplus näher vorzustellen.
This paper discusses German neologisms in the so-called “new-media” and presents a German corpus-based online dictionary of neologisms. Several neological morphemes and lexemes, as well as their meaning will be presented, showing that these new modes of communication are an important source of enrichment of German lexicon.
Am 7. und 8. September fand an der Universität Basel die 4. Sektionentagung der Gesellschaft für Angewandte Linguistik (GAL) statt. Eine der 15 hier vertretenen Sektionen war die Migrationslinguistik, die von Peter Rosenberg (Frankfurt (Oder)) und Christoph Schroeder (Potsdam) geleitet wurde. Die sechs Vorträge dieser Sektion thematisierten laufende Projekte zur Spracharbeit mit Geflüchteten und ihre ersten Ergebnisse, die Gegenstand dieses Berichts sind.
Seit 2017 wird im deutschen Mikrozensus eine Frage zur Sprache der Bevölkerung gestellt. Die letzte Spracherhebung in einem deutschen Zensus datiert aus dem Jahr 1939; entsprechend gibt es aktuell keine aussagekräftigen Sprachstatistiken in Deutschland. Die neue Sprachfrage des Mikrozensus weist jedoch erhebliche Mängel auf; offensichtlich wurde sie als Stellvertreterfrage zur Messung kultureller Integration konzipiert. Im vorliegenden Text werden die Fragen diskutiert und ihre ersten Ergebnisse analysiert. Daran anschließend werden andere Varianten von Sprachfragen dargestellt, dabei wird insbesondere auf die vorbildlichen Sprachfragen im kanadischen Zensus eingegangen. Abschließend wird die Sprachfrage der Deutschland-Erhebung 2018 des IDS inklusive ihrer Ergebnisse vorgestellt; die Deutschland-Erhebung 2018 stellt neben dem Mikrozensus bislang die einzige repräsentative Spracherhebung in Deutschland dar.
In the first volume of Corpus Linguistics and Linguistic Theory, Gries (2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2). doi:10.1515/cllt.2005.1.2.277. http://www.degruyter.com/view//cllt.2005.1.issue-2/cllt.2005.1.2.277/cllt.2005.1.2.277.xml: 285) asked whether corpus linguists should abandon null-hypothesis significance testing. In this paper, I want to revive this discussion by defending the argument that the assumptions that allow inferences about a given population – in this case about the studied languages – based on results observed in a sample – in this case a collection of naturally occurring language data – are not fulfilled. As a consequence, corpus linguists should indeed abandon null-hypothesis significance testing.
The recognizability of a stretch of conduct as social action depends on details of turn construction as well as the turn’s context. We examine details of turn construction as they enter into actions offering interpretations of prior talk. Such actions either initiate repair or formulate a conclusion from prior talk. We focus on how interpretation markers (das heißt [“that means”] vs. du meinst [“you mean”]) and interpretation formats (phrasal vs. clausal turn completions) each make their invariant contribution to specific interpreting practices. Interpretation marker and turn format go hand in hand, which leads to distinct patterns of interpreting practices: Das heißt+clause is especially apt for formulations, du meinst+phrase for repair. The results suggest that details of turn construction can systematically enter into the constitution of social action. Data are in German with English translation.
Das Kombinieren von Daten aus verschiedenen diachronen Korpora bringt besondere methodische Herausforderungen mit sich, die in den vorliegenden Untersuchungen beleuchtet werden. Dazu gehört der Abgleich von Metadaten und ihrer Kategorisierungen, das Verhalten bekannter Phänomene über sich zeitlich überschneidende Korpora hinweg und die Formulierung vergleichbarer Suchabfragen. Anhand von sechs Fallstudien zu graphematischen, lexikalischen, morphologischen und syntaktischen Phänomenen in Korpora des (Früh-) Neuhochdeutschen werden Möglichkeiten und Probleme des diachron korpusübergreifenden Arbeitens herausgearbeitet.
Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian
(2019)
The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.
Plädoyer für die Entwicklung einer digital-lexikografischen Kompetenz im Fremdsprachenunterricht
(2019)
The aim of this paper is to promote an explicit and active development of digital-lexicographical competence in foreign language teaching. The results of two online surveys conducted as part of the research project DICONALE-COMBIDIGILEX in connection with the teaching and learning process of German as a foreign language (= DaF) provide a comparative insight into the behaviour and attitude of both teachers and learners of DaF on the topic “Use of lexicographical resources in the process of DaF-acquisition”. The evaluation of the surveys shows, that the digitallexicographical competences in the process of DaF-acquisition must be promoted more intensively, since the existing lexicographic offer is not optimally used for teaching purposes both on the part of the teachers and on the part of the learners. To this end, the following three main lexicographical competences will be examined from a methodological-didactic and application-oriented perspective: (i) Adequate selection of the electronic resource regarding the communicative situation, (ii) development of disambiguation strategies for reception in L2 or translation from L2 and (iii) development of strategies for production and translation into L2. This research will ultimately lead to a debate on the use of the dictionary in the digital environment in the DaF-teaching and discuss its actual influence on the learning process.
This article investigates the use of überhaupt and sowieso in German and Dutch. These two words are frequently classified as particles, if only because of their pragmatic functions. The frequent use of particles is considered a specific trait common to German and Dutch, and the description of their semantics and pragmatics is notoriously difficult. It is unclear whether both particles have the same meaning in Dutch (where they are loanwords) and German, whether they can fulfil the same syntactic functions and to what extent the (semantic and pragmatic) functions of überhaupt und sowieso overlap. There has already been linguistic research on überhaupt and sowieso by Fisseni (2009) using the world-wide web and by Bruijnen and Sudhoff (2013) using the EUROPARL corpus. In the present study we critically evaluated the corpus study, integrating information on original utterance language and discussing the adequacy of this corpus. Moreover, we conducted an experimental survey collecting subjective-intuitive judgements in three dimensions, thus gathering more data on sparse and informal constructions.
By using these complementary methods, we obtain a more nuanced picture of the use of überhaupt and sowieso in both languages: On the one hand, the data show where the use of both words is more similar and on the other hand, differences between the languages can also be discerned.
Canadian heritage German across three generations: A diary-based study of language shift in action
(2019)
It is well known that migration has an effect on language use and language choice. If the language of origin is maintained after migration, it tends to change in the new contact setting. Often, migrants shift to the new majority language within few generations. The current paper examines a diary corpus containing data from three generations of one German-Canadian family, ranging from 1867 to 1909, and covering the second to fourth generation after immigration. The paper analyzes changes that can be observed between the generations, with respect to the language system as well as to the individuals’ decision on language choice. The data not only offer insight into the dynamics of acquiring a written register of a heritage language, and the eventual shift to the majority language. They also allow us to identify different linguistic profiles of heritage speakers within one community. It is discussed how these profiles can be linked to the individuals’ family backgrounds and how the combination of these backgrounds may have contributed to giving up the heritage language in favor of the majority language.
Linguistic relativists have traditionally asked 'how language influences thought', but conversation analysts and anthropological linguists have moved the focus from thought to social action. We argue that 'social action' should in this context not become simply a new dependent variable, because the formulation 'does language influence action' suggests that social action would already be meaningfully constituted prior to its local (verbal and multi-modal) accomplishment. We draw on work by the gestalt psychologist Karl Duncker to show that close attention to action-in-a-situation helps us ground empirical work on cross-cultural diversity in an appreciation of the invariances that make culture-specific elements of practice meaningful.