Refine
Year of publication
- 2019 (68) (remove)
Document Type
- Article (25)
- Conference Proceeding (21)
- Part of a Book (18)
- Book (2)
- Other (1)
- Part of Periodical (1)
Language
- English (68) (remove)
Is part of the Bibliography
- yes (68) (remove)
Keywords
- Deutsch (23)
- Korpus <Linguistik> (22)
- Automatische Sprachanalyse (8)
- Gesprochene Sprache (8)
- Annotation (5)
- Multimodalität (5)
- Rumänisch (5)
- Sprachstatistik (5)
- Computerlinguistik (4)
- Englisch (4)
Publicationstate
- Veröffentlichungsversion (31)
- Zweitveröffentlichung (28)
- Postprint (12)
Reviewstate
- Peer-Review (51)
- (Verlags)-Lektorat (8)
- (Verlags-)Lektorat (1)
- Peer review (1)
- Peer-review (1)
Publisher
- de Gruyter (8)
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (6)
- Editura Academiei Române (5)
- Lexical Computing CZ s.r.o. (5)
- Elsevier (4)
- Routledge, Taylor & Francis Group (3)
- The Association for Computational Linguistics (3)
- De Gruyter Mouton (2)
- Leibniz-Institut für Deutsche Sprache (2)
- Zenodo (2)
A "polyglottal" speech synthesis - modifications for a replica of Kempelen's speaking machine
(2019)
This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction.
Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.
We present a descriptive analysis on the two datasets from the shared task on Source, Subjective Expression and Target Extraction from Political Speeches (STEPS), the only existing German dataset for opinion role extraction of its size. Our analysis discusses the individual properties of the three components, subjective expressions, sources and targets and their relations towards each other. Our observations should help practitioners and researchers when building a system to extract opinion roles from German data.
Classical null hypothesis significance tests are not appropriate in corpus linguistics, because the randomness assumption underlying these testing procedures is not fulfilled. Nevertheless, there are numerous scenarios where it would be beneficial to have some kind of test in order to judge the relevance of a result (e.g. a difference between two corpora) by answering the question whether the attribute of interest is pronounced enough to warrant the conclusion that it is substantial and not due to chance. In this paper, I outline such a test.
A Supervised learning approach for the extraction of opinion sources and targets from German text
(2019)
We present the first systematic supervised learning approach for the extraction of opinion sources and targets on German language data. A wide choice of different features is presented, particularly syntactic features and generalization features. We point out specific differences between opinion sources and targets. Moreover, we explain why implicit sources can be extracted even with fairly generic features. In order to ensure comparability our classifier is trained and tested on the dataset of the STEPS shared task.
The Lehnwortportal Deutsch (2012 seqq.) serves as an integrated online information system on German lexical borrowings into other languages, synthesizing an increasing number of lexicographical dictionaries and providing basic cross-resource search options. The paper discusses the far-reaching revision of the system’s conceptual, lexicographical and technological underpinnings currently under way, focussing on their relevance for multilingual loanword lexicography.
In the first volume of Corpus Linguistics and Linguistic Theory, Gries (2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2). doi:10.1515/cllt.2005.1.2.277. http://www.degruyter.com/view//cllt.2005.1.issue-2/cllt.2005.1.2.277/cllt.2005.1.2.277.xml: 285) asked whether corpus linguists should abandon null-hypothesis significance testing. In this paper, I want to revive this discussion by defending the argument that the assumptions that allow inferences about a given population – in this case about the studied languages – based on results observed in a sample – in this case a collection of naturally occurring language data – are not fulfilled. As a consequence, corpus linguists should indeed abandon null-hypothesis significance testing.
Distributional models of word use constitute an indispensable tool in corpus based lexicological research for discovering paradigmatic relations and syntagmatic patterns (Belica et al. 2010). Recently, word embeddings (Mikolov et al. 2013) have revived the field by allowing to construct and analyze distributional models on very large corpora. This is accomplished by reducing the very high dimensionality of word cooccurrence contexts, the size of the vocabulary, to few dimensions, such as 100-200. However, word use and meaning can vary widely along dimensions such as domain, register, and time, and word embeddings tend to represent only the most prevalent meaning. In this paper we thus construct domain specific word embeddings to allow for systematically analyzing variations in word use. Moreover, we also demonstrate how to reconstruct domain specific co-occurrence contexts from the dense word embeddings.
This paper describes a rule-based approach to detect direct speech without the help of any quotation markers. As datasets fictional and non-fictional texts were used. Our evaluation shows that the results appear stable throughout different datasets in the fictional domain and are comparable to the results achieved in related work.
Canadian heritage German across three generations: A diary-based study of language shift in action
(2019)
It is well known that migration has an effect on language use and language choice. If the language of origin is maintained after migration, it tends to change in the new contact setting. Often, migrants shift to the new majority language within few generations. The current paper examines a diary corpus containing data from three generations of one German-Canadian family, ranging from 1867 to 1909, and covering the second to fourth generation after immigration. The paper analyzes changes that can be observed between the generations, with respect to the language system as well as to the individuals’ decision on language choice. The data not only offer insight into the dynamics of acquiring a written register of a heritage language, and the eventual shift to the majority language. They also allow us to identify different linguistic profiles of heritage speakers within one community. It is discussed how these profiles can be linked to the individuals’ family backgrounds and how the combination of these backgrounds may have contributed to giving up the heritage language in favor of the majority language.
We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 "Language resource management - Transcription of spoken language". The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.
Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas.
Colonial studies
(2019)
The recognizability of a stretch of conduct as social action depends on details of turn construction as well as the turn’s context. We examine details of turn construction as they enter into actions offering interpretations of prior talk. Such actions either initiate repair or formulate a conclusion from prior talk. We focus on how interpretation markers (das heißt [“that means”] vs. du meinst [“you mean”]) and interpretation formats (phrasal vs. clausal turn completions) each make their invariant contribution to specific interpreting practices. Interpretation marker and turn format go hand in hand, which leads to distinct patterns of interpreting practices: Das heißt+clause is especially apt for formulations, du meinst+phrase for repair. The results suggest that details of turn construction can systematically enter into the constitution of social action. Data are in German with English translation.
In this paper, we present our work-inprogress to automatically identify free indirect representation (FI), a type of thought representation used in literary texts. With a deep learning approach using contextual string embeddings, we achieve f1 scores between 0.45 and 0.5 (sentence-based evaluation for the FI category) on two very different German corpora, a clear improvement on earlier attempts for this task. We show how consistently marked direct speech can help in this task. In our evaluation, we also consider human inter-annotator scores and thus address measures of certainty for this difficult phenomenon.
This article examines a recurrent format that speakers use for defining ordinary expressions or technical terms. Drawing on data from four different languages - Flemish, French, German, and Italian - it focuses on definitions in which a definiendum is first followed by a negative definitional component (‘definiendum is not X’), and then by a positive definitional component (‘definiendum is Y’). The analysis shows that by employing this format, speakers display sensitivity towards a potential meaning of the definiendum that recipients could have taken to be valid. By negating this meaning, speakers discard this possible, yet unintended understanding. The format serves three distinct interactional purposes: (a) it is used for argumentation, e.g. in discussions and political debates, (b) it works as a resource for imparting knowledge, e.g. in expert talk and instructions, and (c) it is employed, in ordinary conversation, for securing the addressee's correct understanding of a possibly problematic expression. The findings contribute to our understanding of how epistemic claims and displays relate to the turn-constructional and sequential organization of talk. They also show that the much quoted ‘problem of meaning’ is, first and foremost, a participant's problem.
We examine the new task of detecting derogatory compounds (e.g. curry muncher). Derogatory compounds are much more difficult to detect than derogatory unigrams (e.g. idiot) since they are more sparsely represented in lexical resources previously found effective for this task (e.g. Wiktionary). We propose an unsupervised classification approach that incorporates linguistic properties of compounds. It mostly depends on a simple distributional representation. We compare our approach against previously established methods proposed for extracting derogatory unigrams.
Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we investigate the impact of task setup and data representation, encoding of context information as well as different model architectures for this task.
We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.
Question Answering Systems for retrieving information from Knowledge Graphs (KG) have become a major area of interest in recent years. Current systems search for words and entities but cannot search for grammatical phenomena. The purpose of this paper is to present our research on developing a QA System that answers natural language questions about German grammar.
Our goal is to build a KG which contains facts and rules about German grammar, and is also able to answer specific questions about a concrete grammatical issue. An overview of the current research in the topic of QA systems and ontology design is given and we show how we plan to construct the KG by integrating the data in the grammatical information system Grammis, hosted by the Leibniz-Institut für Deutsche Sprache (IDS). In this paper, we describe the construction of the initial KG, sketch our resulting graph, and demonstrate the effectiveness of such an approach. A grammar correction component will be part of a later stage. The paper concludes with the potential areas for future research.
Modern theoretical linguistics lives by the insight that the meanings of complex expressions derive from the meanings of their parts and the way these are composed. However, the currently dominating theories of the syntax-semantics interface hastily relegate important aspects of meaning which cannot readily be aligned with visible structure to empty projecting heads non-reductively (mainstream Generative Grammar) or to the syntactic construction holistically (Construction Grammar). This book develops an alternative, compositional analysis of the hidden aspectual-temporal, modal and comparative meanings of a range of productive constructions of which pseudorefl exive, excessive and directional complement constructions take center stage. Accordingly, a contradiction-inducing hence semantically problematic part of literally coded meaning is locally ignored and systematically realized „expatriately“ with respect to parts of structure that achieve the indexical anchoring of propositional contents in terms of times, worlds and standards of comparison, thus yielding the observed hidden meanings.
Linguistic relativists have traditionally asked 'how language influences thought', but conversation analysts and anthropological linguists have moved the focus from thought to social action. We argue that 'social action' should in this context not become simply a new dependent variable, because the formulation 'does language influence action' suggests that social action would already be meaningfully constituted prior to its local (verbal and multi-modal) accomplishment. We draw on work by the gestalt psychologist Karl Duncker to show that close attention to action-in-a-situation helps us ground empirical work on cross-cultural diversity in an appreciation of the invariances that make culture-specific elements of practice meaningful.
This paper investigates emergent pseudo-coordination in spoken German. In a corpus-based study, seven verbs in the first conjunct are analyzed regarding the degree of semantic bleaching and the development of subjective or aspectual meaning components. Moreover, it is shown that each verb shows distinct tendencies for co-ocurrences, especially with deictic adverbs in the first conjunct and with specific verbs and verb classes in the second conjunct. It is argued that pseudo-coordination is originally motivated by the need for ‘chunking’ in unplanned speech and that it is still prominently used in this function in German, in contrast to languages in which pseudo-coordination is grammaticalized further.
Starting from early approaches within Generative Grammar in the late 1960s, the article describes and discusses the development of different theoretical frameworks of lexical decomposition of verbs. It presents the major subsequent conceptions of lexical decompositions, namely, Dowty’s approach to lexical decomposition within Montague Semantics, Jackendoff’s Conceptual Semantics, the LCS decompositions emerging from the MIT Lexicon Project, Pustejovsky’s Event Structure Theory, Wierzbicka’s Natural Semantic Metalanguage, Wunderlich’s Lexical Decompositional Grammar, Hale and Kayser’s Lexical Relational Structures, and Distributed Morphology. For each of these approaches, (i) it sketches their origins and motivation, (ii) it describes the general structure of decompositions and their location within the theory, (iii) it explores their explanative value for major phenomena of verb semantics and syntax, (iv) and it briefly evaluates the impact of the theory. Referring to discussions in article 7 [Semantics: Foundations, History and Methods] (Engelberg) Lexical decomposition, a number of theoretical topics are taken up throughout the paper concerning the interpretation of decompositions, the basic inventory of decompositional predicates, the location of decompositions on the different levels of linguistic representation (syntactic, semantic, conceptual), and the role they play for the interfaces between these levels.
This paper discusses German neologisms in the so-called “new-media” and presents a German corpus-based online dictionary of neologisms. Several neological morphemes and lexemes, as well as their meaning will be presented, showing that these new modes of communication are an important source of enrichment of German lexicon.
Novel formats of construction-based description hold great potential for phenomena that fall through the cracks in traditional kinds of linguistic reference works. On the example of German verb argument structure constructions with a prepositional object, we demonstrate that a construction-based description of such phenomena is superior to existing lexicographic and grammaticographic treatments, but that it also poses a number of new problems. The most fundamental of these relates to the fact that construction-based analyses can be proposed on different levels of abstraction. We illustrate pertinent problems relating to the precise identification of constructional form and meaning and suggest a multi-layered descriptive format for web-based electronic reference constructica that can accommodate these challenges. Semantically, the proposed solution integrates both lumping and splitting perspectives on constructional grain size and permits users to flexibly zoom in and out on individual elements in the resource. Formally, it can capture variation in the number and marking of realised arguments as found in e.g. passives and transitivity alternations. Aspects of the theoretical controversy between Construction Grammar and Valency Theory are addressed where relevant, but our focus is on questions of description and the practical implementation of construction-based analyses in a suitable type of linguistic reference work.
Smooth turn-taking in conversation depends in part on speakers being able to communicate their intention to hold or cede the floor. Both prosodic and gestural cues have been shown to be used in this context. We investigate the interplay of pitch movements and hand gestures at locations at which speaker change becomes relevant, comparing their use in German and Swedish. We find that there are some shared functions of prosody and gesture with regard to turn-taking in the two languages, but that these shared functions appear to be mediated by the different phonological demands on pitch in the two languages.
The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions.
This paper asks whether and in which ways managing coordination tasks in traffic involve the accomplishment of intersubjectivity. Taking instances of coordinating passing an obstacle with oncoming traffic as the empirical case, four different practices were found.
1. Intersubjectivity can be presupposed by expecting others to stick to the traffic code and other mutually shared expectations.
2. Intersubjective solutions emerge step by step by mutual responsive-anticipatory adaptation of driving decisions.
3. Intersubjectivity can be accomplished by explicit interactive negotiation of passages.
4. Coordination problems can be solved without relying on intersubjectivity by unilateral, responsive-anticipatory adaptation to others’ behaviors.
Introduction
(2019)
Language attitudes matter; they influence people’s behaviour and decisions. Therefore, it is crucial to learn more about patterns in the way that languages are evaluated. One means of doing so is using a quantitative approach with data representative of a whole population, so that results mirror dispositions at a societal level. This kind of approach is adopted here, with a focus on the situation in Germany. The article consists of two parts. First, I will present some results of a new representative survey on language attitudes in Germany (the Germany Survey 2017). Second, I will show how language attitudes penetrate even seemingly objective data collection processes by examining the German Microcensus. In 2017, for the first time in eighty years, the German Microcensus included a question on language use ‘at home’. Unfortunately, however, the question was clearly tainted by language attitudes instead of being objective. As a result, the Microcensus significantly misrepresents the linguistic reality of different migrant languages spoken in Germany.
Large-scale empirical evidence indicates a fascinating statistical relationship between the estimated number of language users and its linguistic and statistical structure. In this context, the linguistic niche hypothesis argues that this relationship reflects a negative selection against morphological paradigms that are hard to learn for adults, because languages with a large number of speakers are assumed to be typically spoken and learned by greater proportions of adults. In this paper, this conjecture is tested empirically for more than 2000 languages. The results question the idea of the impact of non-native speakers on the grammatical and statistical structure of languages, as it is demonstrated that the relative proportion of non-native speakers does not significantly correlate with either morphological or information-theoretic complexity. While it thus seems that large numbers of adult learners/speakers do not affect the (grammatical or statistical) structure of a language, the results suggest that there is indeed a relationship between the number of speakers and (especially) information-theoretic complexity, i.e. entropy rates. A potential explanation for the observed relationship is discussed.
Theories of lexical decomposition assume that lexical meanings are complex. This complexity is expressed in structured meaning representations that usually consist of predicates, arguments, operators, and other elements of propositional and predicate logic. Lexical decomposition has been used to explain phenomena such as argument linking, selectional restrictions, lexical-semantic relations, scope ambiguities, and the inference behavior of lexical items. The article sketches the early theoretical development from noun-oriented semantic feature theories to verb-oriented complex decompositions. It also deals with a number of theoretical issues, including the controversy between decompositional and atomistic approaches to meaning, the search for semantic primitives, the function of decompositions as definitions, problems concerning the interpretability of decompositions, and the debate about the cognitive status of decompositions.
Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian
(2019)
The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.
This paper presents first steps towards metaphor detection in German poetry, in particular in expressionist poems. We create a dataset with adjective-noun pairs extracted from expressionist poems, manually annotated for metaphoricity. We discuss the annotation process and present models and experiments for metaphor detection where we investigate the impact of context and the domain dependence of the models.
A large database is a desirable basis for multimodal analysis. The development of more elaborate methods, data banks, and tools for a stronger empirical grounding of multimodal analysis is a prevailing topic within multimodality. Prereq- uisite for this are corpora for multimodal data. Our contribution aims at developing a proposal for gathering and building multimodal corpora of audio-visual social media data, predominantly YouTube data.Our contribution has two parts: First we outline a participation framework which is able to represent the complexity of YouTube communication. To this end we ‘dissect’ the different communicative and multimodal layers YouTube consists of. Besides the Video performance YouTube also integrates comments, social media operators, commercials, and announcements for further YouTube Videos. The data consists of various media and modes and is interactively engaged in various discourses. Hence, it is rather difficult to decide what can be considered as a basic communicative unit (or a ‘turn’) and how it can be mapped. Another decision to be made is which elements are of higher priority than others, thus have to be integrated in an adequate transcription format. We illustrate our conceptual considerations on the example of so-called L e t’s Plays, which are supposed to present and comment Computer gaming processes.The second part is devoted to corpus building. Most previous studies either worked with ad hoc data samples or outlined data mining and data sampling strategies. Our main aim is to delineate in a systematic way and based on the conceptual outline in the first part necessary elements which should be part of a YouTube corpus. To this end we describe in a first Step which components (e.g., the Video itself, the comments, the metadata, etc.) should be captured. ln a second Step we outline why and which relations (e.g., screen appearances, hypertextual struc- tures, etc.) are worth to get part of the corpus. In sum, our contribution aims at outlining a proposal for gathering and systematizing multimodal data, specifically audio-visual social media data, in a corpus derived from a conceptual modeling of important communicative processes of the research object itself.
Naming and titling have been discussed in sociolinguistics as markers of status or solidarity. However, these functions have not been studied on a larger scale or for social media data. We collect a corpus of tweets mentioning presidents of six G20 countries by various naming forms. We show that naming variation relates to stance towards the president in a way that is suggestive of a framing effect mediated by respectfulness. This confirms sociolinguistic theory of naming and titling as markers of status.
We present an approach for automatic detection and correction of OCR-induced misspellings in historical texts. The main objective is the post-correction of the digitized Royal Society Corpus, a set of historical documents from 1665 to 1869. Due to the aged material the OCR procedure has made mistakes, thus leading to files corrupted by thousands of misspellings. This motivates a post processing step. The current correction technique is a pattern-based approach which due to its lack of generalization suffers from bad recall.
To generalize from the patterns we propose to use the noisy channel model. From the pattern based substitutions we train a corpus specific error model complemented with a language model. With an F1-Score of 0.61 the presented technique significantly outperforms the pattern based approach which has an F1-score of 0.28. Due to its more accurate error model it also outperforms other implementations of the noisy channel model.
How do people communicate in mobile settings of interaction? How does mobility affect the way we speak? How does mobility exert influence on the manner in which talk itself is consequential for how we move in space? Recently, questions of this sort have attracted increasing attention in the human and social sciences. This Special Issue contributes to the emerging body of studies on mobility and talk by inspecting an ordinary and ubiquitous phenomenon in which communication among mobile participants is paramount: participation in traffic. This editorial presents previous work on mobility in natural settings, as carried out by interactionally oriented researchers. It also shows how the investigation into traffic participation adds new perspectives to research on language and communication.
This investigation targets a syntactic phenomenon of German which is commonly referred to as the absentive construction. The absentive is considered a universal grammatical category denoting absence. Its syntax is characterised by the occurrence of an auxiliary or copula verb accompanied by a non‐finite VP containing a main verb. The expression of absence, predicated over the clausal subject, is assumed to be based on a constructional meaning. Reviewing a wide range of syntactic and interpretive properties of this structure in German, we will demonstrate that certain empirical claims about the construction are not well founded and that its seemingly idiosyncratic properties are indeed available for compositional analyses. We will propose a structural analysis of its core syntactic and interpretive properties: The predication expresses the localisation of the subject at the location of the event, denoted by the infinitival verb. The interpretation of absence, then, can be explained by an implicature.
The article starts by outlining the theoretical and conceptual foundations in the field of multimodal interaction analysis, which, based on its spatiallinguistic orientation, deals with the meaning of space for the constitution of social meaning. Conceptually, we refer to the ideas of architecture-forinteraction and social topography. Empirically, we look towards the entire range of visually perceptible physical expressions of the Communion participants. We also focus on the spatial prerequisites and the space-related knowledge of the visitors, which becomes evident in their situational behaviour. From our point of view, Communion is not only a ritual in worship but also a task of coordination and positioning. We analyse video excerpts of two Communions in Lutheran-Protestant worship. The central question is: How do the people who hand out the sacrament to the participants take part in the procedure themselves (self-supply)? The video excerpts are from Germany(Rimbach and Zotzenbach, South Hesse). We see self-supply as a situational reproduction of institutional structures and relevancies. Methodologically, we first analyse an example in detail, in which we elaborate constitutive aspects of self-supply and the associated implications in the sense of an arising communitisation of the faithful. The subsequent analysis is carried out from a comparative perspective with reference to the results already obtained. The analyses lead to two basic models. Firstly, we identified a two-phase model in which first the churchgoers and then separately the institution’s representatives celebrate Communion. Structurally linked to this model is the is the diverging presence of those who have already completed the ritual, divergence resulting in two ensembles with their respective interaction space. The churchgoers watch the pastor and his assistants perform the ritual themselves. Secondly, we were able to formulate an integrative model in which the pastor celebrates Communion as one of the community. This preserves cohesion among all churchgoers and there is no ritual display of the institution’s representatives as in the two-phase model. As for model-shaping factors, two aspects become particularly clear: The first are the opportunities which the architecture-forinteraction, i.e. the concrete space for the Communion, provides. The second is the number of participants who perform the ceremony under these spatial conditions. Both aspects have a direct impact on the organisation of Communion, the movement within the church space and, indirectly, on the structure and implications of self-supply.
We present the second edition of the GermEval Shared Task on the Identification of Offensive Language. This shared task deals with the classification of German tweets from Twitter. Two subtasks were continued from the first edition, namely a coarse-grained binary classification task and a fine-grained multi-class classification task. As a novel subtask, we introduce the classification of offensive tweets as explicit or implicit.
The shared task had 13 participating groups submitting 28 runs for the coarse-grained
task, another 28 runs for the fine-grained task, and 17 runs for the implicit-explicit
task.
We evaluate the results of the systems submitted to the shared task. The shared task homepage can be found at https://projects.fzai.h-da.de/iggsa/
Defining groups and affiliating the self and the other with specific social categories is an important part of constructing a colonial conceptualization of societies. Many written documents from the colonial period attest to this practice. The current paper focuses on missionaries’ ways of positioning themselves and others within the colonial context. The German speaking Rheinische Missionsgesellschaft (RMG, Rhenish Mission Society) established mission stations in the Astrolabe Bay area of New Guinea, an area that was under German domination between 1884 and 1914. The paper analyzes how RMG missionaries, by means of language, construct, define, and position different population groups, and it investigates what patterns emerge from these language practices.
Contents:
1. Johannes Graën, Tannon Kew, Anastassia Shaitarova and Martin Volk, "Modelling Large Parallel Corpora", S. 1-8
2. Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary, "Asynchronous Pipelines for Processing Huge Corpora on Medium to Low Resource Infrastructures", S. 9-16
3. Vladimír Benko, "Deduplication in Large Web Corpora", S. 17-22
4. Mark Davies, "The best of both worlds: Multi-billion word “dynamic” corpora", S. 23-28
5. Adrien Barbaresi, "On the need for domain-focused web corpora", S. 29-32
6. Marc Kupietz, Eliza Margaretha, Nils Diewald, Harald Lüngen and Peter Fankhauser, "What's New in EuReCo? Interoperability, Comparable Corpora, Licensing", S. 33-39
Agreement between the verb and its arguments as a predominant phenomenon in language has received major attention in the theoretical literature. One specific aspect under discussion concerns differences between number and person agreement, with the latter being the more restricted one (restricted by Baker’s 2008 SCOPA, by variants of the Person Licensing Condition of Béjar & Rezac 2003, or by multiple agreement see Schütze 2003; Ackema & Neeleman 2018). In this paper we address the restrictions on person agreement with a nominative noun phrase in a low position by investigating a relatively little-discussed configuration, namely specificational copular constructions in Dutch such as dat de inspiratie voor deze roman niet jij %bent / ??is. We provide data from both a production and a rating study comparing 3/2 person agreement and show that what initially looks like a “person effect” in Dutch turns out to be a pronoun effect.
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
(2019)
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
The DRuKoLA project
(2019)
DRuKoLA, the accompanying project in the making of the Corpus of Romanian Language, is a cooperation between German and Romanian computer scientists, corpus linguists and linguists, aiming at linking reference corpora of European languages under one corpus analysis tool able to manage big data. KorAP, the analysis tool developed at the Leibniz Institute for the German Language (Mannheim), is being tailored for the Romanian language in a first attempt to reunite reference corpora under the EuReCo initiative, detailed in this paper. The paper describes the necessary steps of harmonization within KorAP and the corpus of Romanian language and discusses, as one important goal of this project, criteria and ways to build virtual comparable corpora to be used for contrastive linguistic analyses.
The user interfaces for corpus analysis platforms must provide a high degree of accessibility for ordinary users and at the same time provide the possibility to answer complex research questions. In this paper, we present the design concepts behind the user interface of KorAP, a corpus analysis platform that has evolved into the main gateway to CoRoLa, the Reference Corpus of Contemporary Romanian Language. Based on established principles of user interface design, we show how KorAP addresses the challenge of providing a user-friendly interface for heterogeneous corpus data to a wide range of users with different research questions.
Based on conference reports and minutes, archive material and official documents, the article seeks to explore the way in which the promotion of women’s sports and of women in leadership positions became an important part of the sport policy of two major organizations involved in European sport cooperation: the Council of Europe and the European Sport Conference. During first and modest discussions in the 1960s and 1970s it constituted a rather paternalistic project. Also, it was based on the assumption of an essential difference between men and women concerning the need for participation in sport. This only changed since the beginning of the 1980s when women took the course in their own hands, challenged the underlying assumptions and created new networks of cooperation.
In an earlier publication it was claimed that there is no useful relationship between Swahili-English dictionary look-up frequencies and the occurrence frequencies for the same wordforms in Swahili-English corpora, at least not beyond the top few thousand wordforms. This result was challenged using data for German by a different team of researchers using an improved methodology. In the present article the original Swahili-English data is revisited, using ten years’ worth of it rather than just two, and using the improved methodology. We conclude that there is indeed a positive relationship. In addition, we show that online dictionary look-up behaviour is remarkably similar across languages, even when, as in our case, one is dealing with languages from very dissimilar language families. Furthermore, online dictionaries turn out to have minimum look-up success rates, below which they simply cannot go. These minima are language-sensitive and vary depending on the regularity of the searched-for entries, but are otherwise constant no matter the size of randomly sampled dictionaries. Corpus-informed sampling always improves on any random method. Lastly, from the point of view of the graphical user interface, we argue that the average user of an online bilingual dictionary is better served with a single search box, rather than separate search boxes for each dictionary side.
This paper discusses new perspectives for a usage-based paremiology from a corpus-linguistic point of view. Using the example of proverb patterns, it shows different degrees of fixedness and proverb quality in German-English contrast. An interesting insight is that proverb similarities and differences can also be described by restrictions of semi-abstract schemes.
Looking at gestures as a means for communication, they can serve conversational participants at several levels. As co-speech gestures, they can add information to the verbally expressed content and they can serve to manage turn-taking. In order to look closer at the interplay between these resources in face-to face conversation, we annotated hand gestures, syntactic completion points and the related turn-organisation, and measured the timing of gesture strokes and their lexical/phrasal referent. In a case study on German, we observe the trend that speakers vary less in gesturelexis on- and offsets when keeping the turn after syntactic completions than at speaker changes, backchannel or other locations of a conversation. This indicates that timing properties of non-verbal cues interact with verbal cues to manage turn-taking.
This paper presents types and annotation layers of reply relations in computer- mediated communication (CMC). Reply relations hold between post units in CMC interactions and describe references from one given post to a previous post. We classify three types of reply relations in CMC interactions: first, technical replies, i. e. the possibility to reply directly to a previous post by clicking a ‘reply’ button; second, indentations, e. g. in wiki talk pages in which users insert their contributions in the existing talk page by indenting them and third, interpretative reply relations, i. e. the reply action is not realised formally but signalled by other structural or linguistics means such as address markers ‘@’, greetings, citations and/or Q-A structures. We take a look at existing practices in the description and representation of such relations in corpora and examples of chat, Wikipedia talk pages, Twitter and blogs. We then provide an annotation proposal that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by the TEI CMC SIG. It constitutes a prerequisite for correctly identifying higher levels of interactional relations such as dialogue acts or discussion trees.
In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a loss-less standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard ’Transcription of spoken language’ with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the “Tsakorpus” search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future.
In this paper, we investigate the temporal interpretation of propositional attitude complement clauses in four typologically unrelated languages: Washo (language isolate), Medumba (Niger-Congo), Hausa (Afro-Asiatic), and Samoan (Austronesian). Of these languages, Washo and Medumba are optional-tense languages, while Hausa and Samoan are tenseless. Just like in obligatory-tense languages, we observe variation among these languages when it comes to the availability of so-called simultaneous and backward-shifted readings of complement clauses. For our optional-tense languages, we argue that a Sequence of Tense parameter is active in these languages, just as in obligatory-tense languages. However, for completely tenseless clauses, we need something more. We argue that there is variation in the degree to which languages make recourse to res-movement, or a similar mechanism that manipulates LF structures to derive backward-shifted readings in tenseless complement clauses. We additionally appeal to cross-linguistic variation in the lexical semantics of perfective aspect to derive or block certain readings. The result is that the typological classification of a language as tensed, optionally tensed, or tenseless, does not alone determine the temporal interpretation possibilities for complement clauses. Rather, structural parameters of variation cross-cut these broad classes of languages to deliver the observed cross-linguistic picture.
In this paper, we present WebAnno-MM, an extension of the popular web-based annotation tool WebAnno, which is designed for the linguistic annotation of transcribed spoken data with time aligned media files. Several new features have been implemented for our current use case: a novel teaching method based on pair-wise manual annotation of transcribed video data and systematic comparison of agreement between students. To enable the annotation of transcribed spoken language data, apart from technical and data model related challenges, WebAnno-MM offers an additional view to data: a (musical) score view for the inspection of parallel utterances, which is relevant for various methodological research questions regarding the analysis of interactions of spoken content.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.