Refine
Year of publication
- 2017 (98) (remove)
Document Type
- Conference Proceeding (41)
- Part of a Book (27)
- Article (25)
- Book (2)
- Working Paper (2)
- Other (1)
Language
- English (98) (remove)
Keywords
- Korpus <Linguistik> (37)
- Deutsch (20)
- Corpus linguistics (11)
- Computerlinguistik (9)
- Annotation (7)
- Corpus technology (6)
- Internet (6)
- Sprachstatistik (6)
- Texttechnologie (6)
- Englisch (5)
Publicationstate
- Veröffentlichungsversion (67)
- Postprint (14)
- Zweitveröffentlichung (8)
- Preprint (1)
Reviewstate
- Peer-Review (69)
- Peer-review (7)
- (Verlags)-Lektorat (6)
- Peer-Revied (2)
Publisher
The present paper examines the rise and fall of Modern High German loanwords in English from 1600 until 2000, principally making use of the record of borrowing documented by the Oxford English Dictionary (OED) in its Third Edition (online version, in revision 2000-). Groups of loanwords are analysed by century, with reference to the changing social and cultural landscape characterising relationships between the relevant nations over this period. This is not a simple picture: each language grows over the period in different ways, and the speakers of English look to German at different times for different types of borrowing, as the political and intellectual balance alters.
In this paper, we discuss to what extent the German-based contact language Unserdeutsch (Rabaul Creole German, cf. Volker 1982) matches the category‘creole language’ from both a socio-historical and structural perspective. As a point of reference, we will use typological criteria that are widely supposed to be typical for creole languages. It is shown that Unserdeutsch fits fairly well into the pattern of an ‘average creole’, as has been suggested by data in the Atlas of Pidgin and Creole Language Structures (Michaelis et al. 2013). This is despite a series of atypical conditions in its development that might lead us to expect a close structural proximity to the lexifier language, i.e. a relatively acrolectal creole. A possible explanation for this striking discrepancy can be found in the primary function of Unserdeutsch as a marker of identity as well as in the linguistic structure of its substrate language Tok Pisin.
This paper provides insights into the ongoing international research project Unserdeutsch (Rabaul Creole German): Documentation of a highly endangered creole language in Papua New Guinea, based at the University of Augsburg, Germany. It elaborates on the different stages of the project, ranging from fieldwork to corpus development, thereby outlining the methods and software background used for the intended purposes. In doing so, we also give some approaches to solving specific problems, which have arisen in the course of practical work until now.
Basic grammatical categories may carry social meanings irrespective of their semantic content. In a set of four studies, we demonstrate that verbs—a basic linguistic category present and distinguishable in most languages—are related to the perception of agency, a fundamental dimension of social perception. In an archival analysis of actual language use in Polish and German, we found that targets stereotypically associated with high agency (men and young people) are presented in the immediate neighborhood of a verb more often than non-agentic social targets (women and older people). Moreover, in three experiments using a pseudo-word paradigm, verbs (but not adjectives and nouns) were consistently associated with agency (but not with communion). These results provide consistent evidence that verbs, as grammatical vehicles of action, are linguistic markers of agency. In demonstrating meta-semantic effects of language, these studies corroborate the view of language as a social tool and an integral part of social perception.
This paper reports about current practice in a staged approach to the introduction of NLP principles and techniques for students of information science (IIM) and of international communication and translation (ICT) as part of their curricula. As most of these students are rather not familiar with computer science or, in the case of IIM students, linguistics, we see them as comparable with students of the humanities. We follow a blended learning strategy with lectures, online materials, tutorials, and screencasts. In the first two terms, we focus on linguistics and its formalisation, NLP tools and applications are then introduced from the third term on. The lectures are combined with tutorials and - since the summer term 2017 - with a set of screencasts.
This chapter investigates policies which shape the role of the German language in contemporary Estonia. Whereas German played for many centuries an important role as the language of the economic and cultural elite in Estonia, it severely declined in importance throughout the twentieth century. Mirrored on this historical background, the paper provides an overview of the current functions of German and attitudes towards it and it discusses how these functions and attitudes are influenced by policies of various actors from inside and outside Estonia. The paper argues that German continues to play a significant role: while German is no longer a lingua franca, it still enjoys a number of functions and prestige in clearly defined niches involving communication within German-speaking circles or between Estonians and Germans. The interplay of language policies of the Estonian and the German-speaking states as well as by semi-state and private institutions succeed in maintaining German as an additional language in contemporary Estonia.
In this paper we present work in developing a computerized grammar for the Latin language. It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism. The grammar presented here provides a useful resource for natural language processing applications in different fields. It can be easily adopted for language learning and use in language technology for Cultural Heritage like translation applications or to support post-correction of document digitization.
We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.
Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.
In conversation, turn-taking is usually fluid, with next speakers taking their turn right after the end of the previous turn. Most, but not all, previous studies show that next speakers start to plan their turn early, if possible already during the incoming turn. The present study makes use of the list-completion paradigm (Barthel et al., 2016), analyzing speech onset latencies and eye-movements of participants in a task-oriented dialogue with a confederate. The measures are used to disentangle the contributions to the timing of turn-taking of early planning of content on the one hand and initiation of articulation as a reaction to the upcoming turn-end on the other hand. Participants named objects visible on their computer screen in response to utterances that did, or did not, contain lexical and prosodic cues to the end of the incoming turn. In the presence of an early lexical cue, participants showed earlier gaze shifts toward the target objects and responded faster than in its absence, whereas the presence of a late intonational cue only led to faster response times and did not affect the timing of participants' eye movements. The results show that with a combination of eye-movement and turn-transition time measures it is possible to tease apart the effects of early planning and response initiation on turn timing. They are consistent with models of turn-taking that assume that next speakers (a) start planning their response as soon as the incoming turn's message can be understood and (b) monitor the incoming turn for cues to turn-completion so as to initiate their response when turn-transition becomes relevant.
Telicity and agentivity are semantic factors that split intransitive verbs into (at least two) different classes. Clear-cut unergative verbs, which select the auxiliary HAVE, are assumed to be atelic and agent-selecting; unequivocally unaccusative verbs, which select the auxiliary BE, are analyzed as telic and patient-selecting. Thus, agentivity and telicity are assumed to be inversely correlated in split intransitivity. We will present semantic and experimental evidence from German and Mandarin Chinese that casts doubts on this widely held assumption. The focus of our experimental investigation lies on variation with respect to agentivity (specifically motion control, manipulated via animacy), telicity (tested via a locative vs. goal adverbial), and BE/HAVE-selection with semantically flexible intransitive verbs of motion. Our experimental methods are acceptability ratings for German and Chinese (Experiments 1 and 2) and event-related potential (ERP) measures for German (Experiment 3). Our findings contradict the above-mentioned assumption that agentivity and telicity are generally inversely correlated and suggest that for the verbs under study, agentivity and telicity harmonize with each other. Furthermore, the ERP measures reveal that the impact of the interaction under discussion is more pronounced on the verb lexeme than on the auxiliary. We also found differences between Chinese and German that relate to the influence of telicity on BE/HAVE-selection. They seem to confirm the claim in previous research that the weight of the telicity factor locomotion (or internal motion) is cross-linguistically variable.
We present an event-related potentials (ERP) study that addresses the question of how pieces of information pertaining to semantic roles and event structure interact with each other and with the verb’s meaning. Specifically, our study investigates German verb-final clauses with verbs of motion such as fliegen ‘fly’ and schweben ‘float, hover,’ which are indeterminate with respect to agentivity and event structure. Agentivity was tested by manipulating the animacy of the subject noun phrase and event structure by selecting a goal adverbial, which makes the event telic, or a locative adverbial, which leads to an atelic reading. On the clause-initial subject, inanimates evoked an N400 effect vis-à-vis animates. On the adverbial phrase in the atelic (locative) condition, inanimates showed an N400 in comparison to animates. The telic (goal) condition exhibited a similar amplitude like the inanimate-atelic condition. Finally, at the verbal lexeme, the inanimate condition elicited an N400 effect against the animate condition in the telic (goal) contexts. In the atelic (locative) condition, items with animates evoked an N400 effect compared to inanimates. The combined set of findings suggest that clause-initial animacy is not sufficient for agent identification in German, which seems to be completed only at the verbal lexeme in our experiment. Here non-agents (inanimates) changing their location in a goal-directed way and agents (animates) lacking this property are dispreferred and this challenges the assumption that change of (locational) state is generally a defining characteristic of the patient role. Besides this main finding that sheds new light on role prototypicality, our data seem to indicate effects that, in our view, are related to complexity, i.e., minimality. Inanimate subjects or goal arguments increase processing costs since they have role or event structure restrictions that animate subjects or locative modifiers lack.
Syntactic theory has tended to vacillate between implausible methodological extremes. Some linguists hold that our theories are accountable solely for the corpus of attested utterances; others assume our subject matter is unobservable intuitive feelings about sentences. Both extremes should be rejected. The subject matter of syntax is neither past utterance production nor the functioning of inaccessible mental machinery; it is normative - a system of tacitly grasped constraints defining correctness of structure. There are interesting parallels between syntactic and moral systems, modulo the key difference that linguistic systems are diverse whereas morality is universal. The appropriate epistemology for justifying formulations of normative systems is familiar in philosophy: it is known as the method of reflective equilibrium.
The possibilities of re-use and archiving of spoken and written corpora are affected by personality rights (depending on legal tradition also called: the right of publicity), copyright law and data protection / privacy laws. These recommendations include information about legal aspects which should be considered while creating corpora to ensure the greatest archivability and re-usability possible in compliance with current laws.
The information compiled here shall serve researchers who plan to create corpora or who are involved in evaluation of such measures as a guideline. This information is not exhaustive or to be considered as legal advice. Researchers should consult institutional legal departments and management before making legally relevant decisions. That said, further legal expertise should be sought if possible as early as project planning phases.
When appearance does not match accent: neural correlates of ethnicity-related expectancy violations
(2017)
Most research on ethnicity in neuroscience and social psychology has focused on visual cues. However, accents are central social markers of ethnicity and strongly influence evaluations of others. Here, we examine how varying auditory (vocal accent) and visual (facial appearance) information about others affects neural correlates of ethnicity-related expectancy violations. Participants listened to standard German and Turkish-accented speakers and were subsequently presented with faces whose ethnic appearance was either congruent or incongruent to these voices. We expected that incongruent targets (e.g. German accent/Turkish face) would be paralleled by a more negative N2 event-related brain potential (ERP) component. Results confirmed this, suggesting that incongruence was related to more effortful processing of both Turkish and German target faces. These targets were also subjectively judged as surprising. Additionally, varying lateralization of ERP responses for Turkish and German faces suggests that the underlying neural generators differ, potentially reflecting different emotional reactions to these targets. Behavioral responses showed an effect of violated expectations: German-accented Turkish-looking targets were evaluated as most competent of all targets. We suggest that bringing together neural and behavioral measures of expectancy violations, and using both visual and auditory information, yields a more complete picture of the processes underlying impression formation.
Most research on ethnicity has focused on visual cues. However, accents are strong social cues that can match or contradict visual cues. We examined understudied reactions to people whose one cue suggests one ethnicity, whereas the other cue contradicts it. In an experiment conducted in Germany, job candidates spoke with an accent either congruent or incongruent with their (German or Turkish) appearance. Based on ethnolinguistic identity theory, we predicted that accents would be strong cues for categorization and evaluation. Based on expectancy violations theory we expected that incongruent targets would be evaluated more extremely than congruent targets. Both predictions were confirmed: accents strongly influenced perceptions and Turkish-looking German-accented targets were perceived as most competent of all targets (and additionally most warm). The findings show that bringing together visual and auditory information yields a more complete picture of the processes underlying impression formation.
Forms of committed relationships, including formal marriage arrangements between men and women, exist in almost every culture (Bell, 1997). Yet, similarly to many other psychological constructs (Henrich et al., 2010), marital satisfaction and its correlates have been investigated almost exclusively in Western countries (e.g., Bradbury et al., 2000). Meanwhile, marital relationships are heavily guided by culturally determined norms, customs, and expectations (for review see Berscheid, 1995; Fiske et al., 1998). While we acknowledge the differences existing both between- and within-cultures, we measured marital satisfaction and several factors that might potentially correlate with it based on self-report data from individuals across 33 countries. The purpose of this paper is to introduce the raw data available for anybody interested in further examining any relations between them and other country-level scores obtained elsewhere. Below, we review the central variables that are likely to be related to marital satisfaction.
Language of Responsibility. The Influence of Linguistic Abstraction on Collective Moral Emotions
(2017)
Two experiments investigated the effects of linguistic abstractness on the experience of collective moral emotions. In Experiment 1 participants were presented with two scenarios about ingroup misbehavior, phrased using descriptive action verbs, interpretative action verbs, adjectives or nouns. The results show that participants experienced slightly more negative moral emotions with higher levels of linguistic abstractness. In Experiment 2 we also tested for the influence of national identification on the relationship between linguistic abstractness and emotional reactions. Additionally, we expanded the number of scenarios. Experiment 2 replicated the earlier pattern, but found larger differences between conditions. The strength of national identification did not moderate the observed effects. The results of this research are discussed within the context of the linguistic category model and psychology of collective moral emotions.
While good results have been achieved for named entity recognition (NER) in supervised settings, it remains a problem that for low resource languages and less studied domains little or no labelled data is available. As NER is a crucial preprocessing step for many natural language processing tasks, finding a way to overcome this deficit in data remains of great interest. We propose a distant supervision approach to NER that is both language and domain independent where we automatically generate labelled training data using gazetteers that we previously extracted from Wikipedia. We test our approach on English, German and Estonian data sets and contribute further by introducing several successful methods to reduce the noise in the generated training data. The tested models beat baseline systems and our results show that distant supervision can be a promising approach for NER when no labelled data is available. For the English model we also show that the distant supervision model is better at generalizing within the same domain of news texts by comparing it against a supervised model on a different test set.
Unknown words are a challenge for any NLP task, including sentiment analysis. Here, we evaluate the extent to which sentiment polarity of complex words can be predicted based on their morphological make-up. We do this on German as it has very productive processes of derivation and compounding and many German hapax words, which are likely to bear sentiment, are morphologically complex. We present results of supervised classification experiments on new datasets with morphological parses and polarity annotations.
We present a major step towards the creation of the first high-coverage lexicon of polarity shifters. In this work, we bootstrap a lexicon of verbs by exploiting various linguistic features. Polarity shifters, such as ‘abandon’, are similar to negations (e.g. ‘not’) in that they move the polarity of a phrase towards its inverse, as in ‘abandon all hope’. While there exist lists of negation words, creating comprehensive lists of polarity shifters is far more challenging due to their sheer number. On a sample of manually annotated verbs we examine a variety of linguistic features for this task. Then we build a supervised classifier to increase coverage. We show that this approach drastically reduces the annotation effort while ensuring a high-precision lexicon. We also show that our acquired knowledge of verbal polarity shifters improves phrase-level sentiment analysis.
We use a convolutional neural network to perform authorship identification on a very homogeneous dataset of scientific publications. In order to investigate the effect of domain biases, we obscure words below a certain frequency threshold, retaining only their POS-tags. This procedure improves test performance due to better generalization on unseen data. Using our method, we are able to predict the authors of scientific publications in the same discipline at levels well above chance.
This paper presents a survey on hate speech detection. Given the steadily growing body of social media content, the amount of online hate speech is also increasing. Due to the massive scale of the web, methods that automatically detect hate speech are required. Our survey describes key areas that have been explored to automatically recognize these types of utterances using natural language processing. We also discuss limits of those approaches.
We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency parser and compare the results to the ones obtained with other types of embeddings. Our evaluation shows that adding compound embeddings yields a significant improvement of 2% LAS over using word embeddings when no POS information is available. When adding POS embeddings to the input, however, the effect levels out. This suggests that it is not the missing information about the semantics of the unknown words that causes problems for parsing German, but the lack of morphological information for unknown words. To augment our evaluation, we also test the new embeddings in a language modelling task that requires both syntactic and semantic information.
Universal Dependency (UD) annotations, despite their usefulness for cross-lingual tasks and semantic applications, are not optimised for statistical parsing. In the paper, we ask what exactly causes the decrease in parsing accuracy when training a parser on UD-style annotations and whether the effect is similarly strong for all languages. We conduct a series of experiments where we systematically modify individual annotation decisions taken in the UD scheme and show that this results in an increased accuracy for most, but not for all languages. We show that the encoding in the UD scheme, in particular the decision to encode content words as heads, causes an increase in dependency length for nearly all treebanks and an increase in arc direction entropy for many languages, and evaluate the effect this has on parsing accuracy.
To improve grammatical function labelling for German, we augment the labelling component of a neural dependency parser with a decision history. We present different ways to encode the history, using different LSTM architectures, and show that our models yield significant improvements, resulting in a LAS for German that is close to the best result from the SPMRL 2014 shared task (without the reranker).
We introduce a method for error detection in automatically annotated text, aimed at supporting the creation of high-quality language resources at affordable cost. Our method combines an unsupervised generative model with human supervision from active learning. We test our approach on in-domain and out-of-domain data in two languages, in AL simulations and in a real world setting. For all settings, the results show that our method is able to detect annotation errors with high precision and high recall.
In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.
"What makes this so complicated?" On the value of disorienting dilemmas in language instruction
(2017)
This paper deals with the creation of the first morphological treebank for German by merging two pre-existing linguistic databases. The first of these is the linguistic database CELEX which is a standard resource for German morphology. We build on its refurbished and modernized version. The second resource is GermaNet, a lexical-semantic network which also provides partial markup for compounds. We describe the state of the art and the essential characteristics of both databases and our latest revisions. As the merging involves two data sources with distinct annotation schemes, the derivation of the morphological trees for the unified resource is not trivial. We discuss how we overcome problems with the data and format, in particular how we deal with overlaps and complementary scopes. The resulting database comprises about 100,000 trees whose format can be chosen according to the requirements of the application at hand. In our discussion, we show some future directions for morphological treebanks. The Perl script for the generation of the data from the sources will be made publicly available on our website.
In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.
This paper discusses changes of lexicographic traditions with respect to approaches to meaning descriptions towards more cognitive perspectives. I will uncover how cognitive aspects can be incorporated into meaning descriptions based on corpus-driven analysis. The new German Online dictionary “Paronyme − Dynamisch im Kontrast” (Storjohann 2014; 2016) is concerned with easily confused words such as effektiv/effizient, sensibel/sensitiv. It is currently in the process of being developed and it aims at adopting a more conceptual and encyclopaedic approach to meaning by incorporating cognitive features. As a corpus-guided reference work it strives to adequately reflect ideas such as conceptual structure, categorisation and knowledge. Contrastive entries emphasise aspects of usage, comparing conceptual categories and indicate the (metonymic) mapping of knowledge. Adaptable access to lexicographic details and variable search options offer different foci and perspectives on linguistic information, and authentic examples reflect prototypical structures. Some of the cognitive features are demonstrated with the help of examples. Firstly, I will outline how patterns of usage imply conceptual categories as central ideas instead of sufficiently logical criteria of semantic distinction. In this way, linguistic findings correlate better with how users conceptualise language. Secondly, it is pointed out how collocates are treated as family members and fillers in contexts. Thirdly, I will demonstrate how contextual structure and functions are included summarising referential information. Details are drawn from corpus data, they are usage-based linguistic patterns illustrating conversational interaction and semantic negotiations in contemporary public discourse. Finally, I will outline consultation routines which activate different facets of structural knowledge, e.g. through changes of the ordering of information or through the visualisation of semantic networks.
In the lexicon of pidgin and creole languages we can see an important part of these languages’ history of origin and of language contact. The current paper deals with the lexical sources of Tok Pisin and, more specifically, with words of German origin found in this language. During the period of German colonial domination of New Guinea and a number of insular territories in the Pacific (ca. 1885–1915), German words entered the emerging Tok Pisin lexicon. Based on a broad range of lexical and lexicographic data from the early 20th century up until today, we investigate the actual or presumed German origin of a number of Tok Pisin words and trace different lexical processes of integration that are linked to various, often though not always colonially determined, contact settings and sociocultural interactions.
Wolfgang von Kempelen's book "The Mechanism of Human Speech" from 1791 is a famous milestone in the history of speech communication research. It has an enormous relevance for the phonetic sciences and it marks an important turning point for the development of the (mechanical) speech synthesis. So far no English version of this work was available, which excludes many interested researchers. Access to the original versions in German and French is restricted for various reasons. For example the blackletter script of the German version is troublesome for most of today's readers. We report here on a new edition of Kempelen's book which unites a better readable German version and its English translation. It will now also be in a searchable electronic format and has been enriched with many commentaries, which aid in the understanding of details of the late 18th century that are little known or unknown to many researchers today.
There are a number of recent replicas of Wolfgang von Kempelen's speaking machine. Although all of them are explicitly based on Kempelen's own description nearly none of them are identical in construction and sound. In this paper we want to illustrate some of these differences and their reasons for five replicas built by ourselves.
This paper argues for using authentic data not only as an empirical basis for linguistic generalizations but also for exemplification purposes in monolingual and particularly in bi- and multilingual contrastive studies. It shows that parallel data extracted from the available parallel corpora can - after enrichment with semantic-functional information while maintaining the available contextual, register-related and linguistic information - serve as a perfect data source for multilingual exemplification. Moreover, the analysis of semantic-functionally equivalent parallel sequences allows the investigation and exemplification of similarities and differences in how different languages express similar meaning from both a semasiological and an onomasiological perspective.
This paper provides a formal semantic analysis of past interpretation in Medumba (Grassfields Bantu), a graded tense language. Based on original fieldwork, the study explores the empirical behavior and meaning contribution of graded past morphemes in Medumba and relates these to the account of the phenomenon proposed in Cable (Nat Lang Semant 21:219–276, 2013) for Gĩkũyũ. Investigation reveals that the behavior of Medumba gradedness markers differs from that of their Gĩkũyũ counterparts in meaningful ways and, more broadly, discourages an analysis as presuppositional eventuality or reference time modifiers. Instead, the Medumba markers are most appropriately analyzed as quantificational tenses. It also turns out that Medumba, though belonging to the typological class of graded tense languages, shows intriguing similarities to genuinely tenseless languages in allowing for temporally unmarked sentences and exploiting aspectual and pragmatic cues for reference time resolution. The more general cross-linguistic implication of the study is that the set of languages often subsumed under the label “graded tense” does not in fact form a natural class and that more case-by-case research is needed to refine this category.
In my talk, I present an empirical approach to detecting and describing proverbs as frozen sentences with specific functions in current language use. We have developed this approach in the EU project ‘SprichWort’ (based on the German Reference Corpus). The first chapter illustrates selected aspects of our complex, iterative procedure to validate proverb candidates. Based on our corpus-driven lexpan methodology of slot analysis I then discuss semantic restrictions of proverb patterns. Furthermore, I show different degrees of proverb quality ranging from genuine proverbs to non-proverb realizations of the same abstract pattern. On the one hand, the corpus validation reveals that proverbs are definitely perceived and used as relatively fixed entities and often as sentences. On the other hand, proverbs are not only interpreted as an interesting unique phenomenon but also as part of the whole lexicon, embedded in networks of different lexical items.
This paper explores the syntax of agreement in Insular Scandinavian in copular clauses with two potential goals for agreement. Data from three production experiments - one in Faroese and two in Icelandic - establish several new facts. First, in both languages agreement with the second nominal (DP2) is possible/preferred. Second, there is considerable variation (both within and between languages, and indeed speakers) in the patterns observed. Third, Icelandic shows a surprising pattern of “partial” agreement with DP2 - agreement in number but not person. We discuss the implications for current theorising about agreement, proposing that in these languages, at least, agreement is downwards, and that the available agreement options depend in part on the syntactic position of DPI when agreement is established.
Language Change
(2017)
The present chapter outlines a research program for historical linguistics based on the idea that the object of the formal study of language change should be defined as grammar change, that is, a set of discrete differences between the target grammar and the grammar acquired by the learner (Hale 2007). This approach is shown to offer new answers to some classical problems of historical linguistics (Weinreich et al. 1968), concerning, specifically, the actuation of changes and the observation that the transition from one historical state to another proceeds gradually. It is argued that learners are highly sensitive to small fluctuations in the linguistic input they receive, making change inevitable, while the impression of gradualness is linked to independent factors (diffusion in a speech community, and grammar competition). Special attention is paid to grammaticalization phenomena, which offer insights into the nature of functional categories, the building blocks of clause structure.
The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.
As a consequence of a recent curation project, the Dortmund Chat Corpus is available in CLARIN-D research infrastructures for download and querying. In a legal expertise it had been recommended that standard measures of anonymisation be applied to the corpus before its republication. This paper reports about the anonymisation campaign that was conducted for the corpus. Anonymisation has been realised as categorisation, and the taxonomy of anonymisation categories applied is introduced and the method of applying it to the TEI files is demonstrated. The results of the anonymisation campaign as well as issues of quality assessment are discussed. Finally, pseudonymisation as an alternative to categorisation as a method of the anonymisation of CMC data is discussed, as well as possibilities of an automatisation of the process.
In this paper we present the results of an automatic classification of Russian texts into three levels of difficulty. Our aim is to build a study corpus of Russian, in which a L2 student is able to select texts of a desired complexity. We are building on a pilot study, in which we classified Russian texts into two levels of difficulty. In the current paper, we apply the classification to an extended corpus of 577 labelled texts. The best-performing combination of features achieves an accuracy of 0,74 within at most one level difference.