Refine
Year of publication
Document Type
- Part of a Book (48)
- Conference Proceeding (36)
- Article (24)
- Book (1)
- Doctoral Thesis (1)
- Master's Thesis (1)
- Review (1)
Language
- English (112) (remove)
Has Fulltext
- yes (112)
Is part of the Bibliography
- no (112) (remove)
Keywords
- Deutsch (21)
- Computerlinguistik (19)
- Korpus <Linguistik> (15)
- Natürliche Sprache (12)
- Sprachpolitik (12)
- Maschinelles Lernen (10)
- Information Extraction (9)
- Mehrsprachigkeit (8)
- Semantik (8)
- Syntax (7)
Publicationstate
- Zweitveröffentlichung (112) (remove)
Reviewstate
Publisher
- de Gruyter (13)
- Springer (10)
- Narr (8)
- Narr Francke Attempto (5)
- Niemeyer (4)
- Association for Computing Machinery (3)
- Benjamins (3)
- Oxford University Press (3)
- Peter Lang (3)
- Wiley (3)
Ancient Chinese poetry is constituted by structured language that deviates from ordinary language usage; its poetic genres impose unique combinatory constraints on linguistic elements. How does the constrained poetic structure facilitate speech segmentation when common linguistic and statistical cues are unreliable to listeners in poems? We generated artificial Jueju, which arguably has the most constrained structure in ancient Chinese poetry, and presented each poem twice as an isochronous sequence of syllables to native Mandarin speakers while conducting magnetoencephalography (MEG) recording. We found that listeners deployed their prior knowledge of Jueju to build the line structure and to establish the conceptual flow of Jueju. Unprecedentedly, we found a phase precession phenomenon indicating predictive processes of speech segmentation—the neural phase advanced faster after listeners acquired knowledge of incoming speech. The statistical co-occurrence of monosyllabic words in Jueju negatively correlated with speech segmentation, which provides an alternative perspective on how statistical cues facilitate speech segmentation. Our findings suggest that constrained poetic structures serve as a temporal map for listeners to group speech contents and to predict incoming speech signals. Listeners can parse speech streams by using not only grammatical and statistical cues but also their prior knowledge of the form of language.
The project Referenzkorpus Altdeutsch (‘Old German Reference Corpus’) aims to es- tablish a deeply-annotated text corpus of all extant Old German texts. As the automated part-of-speech and morphological pre-annotation is amended by hand, a quality control system for the results seems a desirable objective. To this end, standardized inflectional forms, generated using the morphological information, are compared with the attested word forms. Their creation is described by way of example for the Old High German part of the corpus. As is shown, in a few cases, some features of the attested word forms are also required in order to determine as exactly as possible the shape of the inflected lemma form to be created.
The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem which was memorably named by Labov (1992) as one of “Bad Data”. However, such corpora can only give us access to an increased amount ot historical material and this can essentially still only be a partial and possibly distorted picture of the actual language at a particular period of history. Corpora can be improved by taking a more representative sample of extant texts if these are available (as they are in significant number for periods after the invention of printing). But, as examples from the recently compiled GerManC corpus of seventeenth and eighteenth century German show, the evidence from such corpora can still fail to yield definitive answers to our questions about earlier stages of a language. The data still require expert interpretation, and it is important to be realistic about what can legitimately be expected from an electronic historical corpus.
Multi-faceted alignment. Toward automatic detection of textual similarity in Gospel-derived texts
(2015)
Ancient Germanic Bible-derived texts stand in as test material for producing computational means for automatically determining where textual contamination and linguistic interference have influenced the translation process. This paper reports on the results of research efforts that produced a text corpus; a method for decomposing the texts involved into smaller, more directly comparable thematically-related chunks; a database of relationships between these chunks; and a user-interface allowing for searches based on various referential criteria. Finally, the state of the product at the end of the project is discussed, namely as it was handed over to another researcher who has extended it to automatically find semantic and syntactic similarities within comparable chunks.
In this paper we present some preliminary considerations concerning the possibility of automatic parsing an annotated corpus for N-N compounds. This should in prin- ciple be possible at least for relational and stereotype compounds, if the lemmatization of the corpus connects the lemmata with lexical entries as described in Höhle (1982). These lexical entries then supply the necessary information about the argument structure of a relational noun or about the stereotypical purpose associated with the noun’s referent which can be used to establish a relation between the first and the head constituent of the compound.
The relative order of dative and accusative objects in older German is less free than it is today. The reason for this could be that speakers of the direct predecessor of Old High German organized the referents according to the Thematic Hierarchy. If one applies a Case Hierarchy Nom>Acc>Dat to this, the order Nom - Dat - Acc falls out. It becomes apparent that the status of the Thematic Hierarchy is not a factor governing underlying word order, but a factor inducing scrambling. Arguments from binding theory, whose validity is discussed, indicate that the underlying order is ‘accusative before dative’
Latvia
(2019)
This chapter deals with current issues in bilingual education in the framework of language and educational policies in Latvia, and also outlines similarities or common tendencies in the two other Baltic states, Estonia and Lithuania. As commonly understood in the 21st century, the term ‘bilingual education’ includes ‘multilingual education, as the umbrella term to cover a wide spectrum of practice and policy’ (García, 2009: 9).
Our paper discusses family language policies among multilingual families in Latvia with Russian as home language. The presentation is based on three case studies, i.e. interviews conducted with Russophones who have chosen to send their children to Latvian-medium pre-schools and schools. The main aim is to understand practices and regards among such families “from below,” i.e. which family-internal and family-external factors influenced the choice of Latvian-medium education and what impact this choice has on linguistic practices.
The paper shows that there have been critical events which both encouraged and discouraged the choice of Latvian-medium education. The wish to integrate into mainstream society has been met by obstacles both from ethnic Russians and Latvians. Yet, the three families consider their choices to be the right ones for the future development of their children in a multiethnic Latvia in which Latvian serves as the unifying language of society.
The present paper examines the rise and fall of Modern High German loanwords in English from 1600 until 2000, principally making use of the record of borrowing documented by the Oxford English Dictionary (OED) in its Third Edition (online version, in revision 2000-). Groups of loanwords are analysed by century, with reference to the changing social and cultural landscape characterising relationships between the relevant nations over this period. This is not a simple picture: each language grows over the period in different ways, and the speakers of English look to German at different times for different types of borrowing, as the political and intellectual balance alters.
This study explores the interdependence of qualitative and quantitative analysis in articulating empirically plausible and theoretically coherent generalizations about grammatical structure. I will show that the use of large electronic corpora is indispensable to the grammarian's work, serving as a rich source of semantic and contextual information, which turns out to be crucial in categorizing and explaining grammatical forms. These general concerns are illustrated by the patterns of use of Czech relative clauses (RC) with the non-declinable relativizer co, by taking a set of existing claims about these RCs and testing their accuracy on corpus material. The relevant analytic categories revolve around the referential type of the relativized noun, the interaction between relativization and deixis, and the semantic relationship between the relativized noun and the proposition expressed by the RC. The analysis demonstrates that some of the existing claims are fully invalid in the face of regularly attested semantic distinctions, while others are more or less on the right track but often not comprehensive or precise enough to capture the full richness of the facts. 1
Conversation is usually considered to be grammatically simple, while academic writing is often claimed to be structurally complex, associated primarily with a greater use of dependent clauses. Our goal in the present paper is to challenge these stereotypes, based on the results of large-scale corpus investigations. We argue that both conversation and professional academic writing are grammatically complex but that their complexities are dramatically different. Surprisingly, the traditional view that complexity is realized through extensive clausal embedding leads to the conclusion that conversation is more complex than academic writing. In contrast, written academic discourse is actually much more ‘compressed’ than elaborated, and the complexities of academic writing are realized mostly as phrasal embedding rather than embedded clauses.
This paper first argues that the distinction between Propositions and States-of-Affairs is significant for understanding a number of linguistic contrasts, including contrasts between nominalizations, complement clauses, readings of modal infinitives, raising constructions, illocutions and moods, relative clauses, and nouns. Subsequently, the paper outlines a cognitive linguistic model of the distinction, according to which Propositions and States-of-Affairs differ in terms of construal. Both prompt Langackerian “processes”, but only Propositions prompt a construal of these processes as referential. The paper argues that this model has a number of advantages over a traditional, denotational understanding of the distinction.
The present article proposes a syntactic and semantic analysis of assertive clauses that comprises their truth-conditional aspects and their speech act potential in communication. What is commonly called “illocutionary force” is differentiated into three structurally and functionally distinct layers: a judgement phrase, representing subjective epistemic and evidential attitudes; a commitment phrase, representing the social commitment related to assertions; and an act phrase, representing the relation to the common ground of the conversation. The article provides several pieces of evidence for this structure: from the interpretation and syntactic position of various classes of epistemic, evidential, affirmative and speech act-related operators, from clausal complements embedded by different types of predicates, from embedded root clauses, and from anaphora referring to different clausal projections. The syntactic assumptions are phrased within X-bar theory, and the semantic interpretation makes use of dynamic update of common ground, differentiating between informative and performative updates. The object language is German, with particular reference to verb final and verb second structure.
This article describes an English Zulu learners’ dictionary that is part of a larger set of information tools, namely an online Zulu course, an e-dictionary of possessives (which was implemented earlier) accompanied by training software offering translation tasks on several levels, and an ontology of morphemic items categorizing and describing all parts of speech of Zulu. The underlying lexicographic database contains the usual type of lexicographic data, such as translation equivalents and their respective morphosyntactic data, but its entries have been extended with data related to the lessons of the online course in order to enable the learner to link both tools autonomously. The ‘outer matter’ is integrated into the website in the form of several texts on additional web pages (how-to-use, typical outputs, grammar tables, information on morphosyntactic rules, etc.). The dictionary comprises a modular system, where each module fulfils one of the necessary functions.
In the context of a Nordic Conference on Bilingualism, it can be a rewarding task to look at issues such as language planning, policy and legislation from a perspective of the southern neighbours of the Nordic world. This paper therefore intends to point attention towards a case of societal multilingualism at the periphery of the Nordic world by dealing with recent developments in language policy and legislation with regard to the North Frisian speech community in the German Land of Schleswig-Holstein. As I will show, it is striking to what degree there are considerable differences in the discourse on minority protection and language legislation between the Nordic countries and a cultural area which may arguably be considered to be part of the Nordic fringe - and which itself occasionally takes Scandinavia as a reference point, e.g. in the recent adoption of a pan-Frisian flag modelled on the Nordic cross (Falkena 2006).
The main focus of the paper will be on the Frisian Act which was passed in the Parliament of Schleswig-Holstein in late 2004. It provides a certain legal basis for some political activities with regard to Frisian, but falls short of creating a true spirit of minority language protection and/or revitalisation. In contrast to the traditions of the German and Danish minorities along the German-Danish border and to minority protection in Northern Scandinavia (in particular to Sámi language rights), the approach chosen in the Frisian Act is extremely weak and has no connotation of long-term oriented language-planning, let alone a rights-based perspective.
The paper will then look at policy developments in the time since the Act was passed, e.g. in the Schleswig-Holstein election campaign in 2005, and on latest perceptions of the Frisian language situation in the discourse on North Frisian Policy in Schleswig-Holstein majority society. In the final part of the paper, I will discuss reasons for the differences in minority language policy discourse between Germany and the Nordic countries, and try to provide an outlook on how Frisian could benefit from its geographic proximity to the Nordic world.
This chapter explores the Linguistic Landscape of six medium-size towns in the Baltic States with regard to languages of tourism and to the role of English and Russian as linguae francae. A quantitative analysis of signs and of tourism web sites shows that, next to the state languages, English is the most dominant language. Yet, interviews reveal that underneath the surface, Russian still stands strong. Therefore, possible claims that English might take over the role of the main lingua franca in the Baltic States cannot be maintained. English has a strong position for attracting international tourists, but only alongside Russian which remains important both as a language of international communication and for local needs.
Basic grammatical categories may carry social meanings irrespective of their semantic content. In a set of four studies, we demonstrate that verbs—a basic linguistic category present and distinguishable in most languages—are related to the perception of agency, a fundamental dimension of social perception. In an archival analysis of actual language use in Polish and German, we found that targets stereotypically associated with high agency (men and young people) are presented in the immediate neighborhood of a verb more often than non-agentic social targets (women and older people). Moreover, in three experiments using a pseudo-word paradigm, verbs (but not adjectives and nouns) were consistently associated with agency (but not with communion). These results provide consistent evidence that verbs, as grammatical vehicles of action, are linguistic markers of agency. In demonstrating meta-semantic effects of language, these studies corroborate the view of language as a social tool and an integral part of social perception.
Nonnative accents are prevalent in our globalized world and constitute highly salient cues in social perception. Whereas previous literature has commonly assumed that they cue specific social group stereotypes, we propose that nonnative accents generally trigger spontaneous negatively biased associations (due to a general nonnative accent category and perceptual influences). Accordingly, Study 1 demonstrates negative biases with conceptual IATs, targeting the general concepts of accent versus native speech, on the dimensions affect, trust, and competence, but not on sociability. Study 2 attests to negative, largely enhanced biases on all dimensions with auditory IATs comprising matched native–nonnative speaker pairs for four accent types. Biases emerged irrespective of the accent types that differed in attractiveness, recognizability of origin, and origin-linked national associations. Study 3 replicates general IAT biases with an affect IAT and a conventional evaluative IAT. These findings corroborate our hypotheses and assist in understanding general negativity toward nonnative accents.
Communication of stereotypes in the classroom: biased language use of German and Turkish adolescents
(2014)
Little is known about the linguistic transmission and maintenance of mutual stereotypes in interethnic contexts. This field study, therefore, investigated the linguistic expectancy bias (LEB) and the linguistic intergroup bias (LIB) among German and Turkish adolescents (13 to 20 years) in the school context. The LEB refers to the general phenomenon of describing stereotypes more abstractly. The LIB is the tendency to use language abstraction for in-group protective reasons. Results revealed an unmoderated LEB, whereas the LIB only occurred when foreigners were in the numerical majority, the classroom composition was perceived as a learning disadvantage, or the interethnic conflict frequency was high. These findings provide first evidence for the use of both LEB and LIB in an interethnic classroom setting.
This chapter will present results of a linguistic landscape (LL) project in the regional centre of Rēzekne in the region of Latgale in Eastern Latvia. Latvia was de facto a part of the Soviet Union until 1991, and this has given it a highly multilingual society. In the essentially post-colonial situation since 1991, strict language policies have been in place, which aim to reverse the language shift from Russian, the dominant language of Soviet times, back to Latvian. Thus, the main interests of the research were how the complex pattern of multilingualism in Latvia is reflected in the LL; how people relate to current language legislation; and what motivations, attitudes and emotions inform their behaviour.
Sexual harassment severely impacts the educational system in the West African country Benin and the progress of women in this society that is characterized by great gender inequality. Knowledge of the belief systems rooting in the sociocultural context is crucial to the understanding of sexual harassment. However, no study has yet investigated how sexual harassment is related to fundamental beliefs in Benin or West African countries. We conducted a field study on 265 female and male students from several high schools in Benin to investigate the link between sexual harassment and measures of ambivalent sexism, gender identity, and rape myth acceptance. Almost half of the sample reported having experienced sexual harassment personally or among peers. Levels of sexism and rape myth acceptance were very high compared to other studies. These attitudes appeared to converge in a sexist belief system that was linked to personal experiences, the perceived probability of experiencing and fear of sexual harassment. Results suggest that sexual harassment is a societal problem and that interventions need to address fundamental attitudes held in societies low in gender equality.
Nonnative-accented speakers face prevalent discrimination. The assumption that people freely express negative sentiments toward nonnative speakers has also guided common research methods. However, recent studies did not consistently find downgrading, so that prejudice against nonnative accents might even be questioned at first sight. The present theoretical article will bridge these contradictory findings in three ways: (a) We illustrate that nonnative speakers with foreign accents frequently may not be downgraded in commonly used first-impression and employment scenario paradigms. It appears that relatively controlled responding may be influenced by norms and motivations to respond without prejudice, whereas negative biases emerge in spontaneous responding. (b) We present an integrative view based on knowledge on modern forms of prejudice to develop modern notions of accent-ism, which allow for predictions when accent biases are (not) likely to surface. (c) We conclude with implications for interventions and a tailored research agenda.
The present research unites two emergent trends in the area of language attitudes: (a) research on perceptions of nonnative speakers by nonnative listeners and (b) the search for general, basic mechanisms underlying the evaluation of nonnative accented speakers. In three experiments featuring an employment situation, German participants listened to a presentation given in English by a German speaker with a strong versus native-like accent (in Studies 1–3) versus a native speaker of English (in Study 1). They evaluated candidates with a strong accent worse than candidates with a native(-like) pronunciation—even to the degree that the quality of arguments was of no relevance (Study 1). Study 2 introduces an effective intervention to reduce these discriminatory tendencies. Across studies, affect and competence emerged as major mediators of hirability evaluations. Study 3 further revealed sequential indirect influences, which advance our understanding of previous inconsistent findings regarding disfluency and warmth perceptions.
The establishment of Scottish Parliament: What difference does it make for the Gaelic language?
(2004)
After the Labour government takeover in Westminster in 1997, followed by the referendum on establishing a Scottish Parliament, hopes for more support for the Gaelic language in Scotland were nourished. In the election campaign to the Scottish Parliament in 1999, all parties which were elected to Parliament had mentioned Gaelic, and all parties except the Conservatives had promised an increase in support for Gaelic (cf. Scottish parties’ election manifestoes, obtainable from the parties or via their web sites). Now that the new Scottish Executive, formed by Labour and the Liberal Democrats, has been in power for some time, it is interesting to see if these hopes have been fulfilled.
The two core questions of this paper will thus be:
1. What is the status of Scottish Gaelic after the devolution process?
2. What difference does the existence of the Scottish Parliament make for the status of Gaelic?
It is important to note that this paper refers to language status and Gaelic’s position from a mere language policy perspective. The results are mostly based on an analysis of Parliament documents, the method of investigation being strictly philological. Empirical research has not yet been undertaken. The reference time of my paper will be the first year of Scottish Parliament and the new executive. Even though this is an arbitrary time break, the first year is a symbolic point of time. As the first legislation period as a possibly more natural reference point is not over yet, this choice seems legitimate.
This chapter analyses the impact of political decentralization in a state on the position of ethnic and linguistic minorities, in particular with regard to the role of parliamentary assemblies in the political system. It relates a number of typical functions of parliaments to the specific needs of minorities and their languages. The most important of these functions are the representation of the minority and responsiveness to the minority’s needs. The chapter then discusses six examples from the European Union (and Norway) which prototypically represent different types of parliamentary decentralization: the ethnically defined Sameting in Norway and its importance for the Sámi population, the Scottish Parliament and its role for speakers of Scottish Gaelic, the German regional parliaments of the Länder of Schleswig-Holstein and Saxony and their impact on the Frisian and Sorbian minorities respectively, the autonomy of predominantly German-speaking South Tyrol within the Italian state, and finally the situation of the speakers of Latgalian in Latvia, where a decentralized parliament is missing. The chapter also makes suggestions on comparisons of these situations with minorities in Russia. It finally argues that political decentralization may indeed empower minorities to gain a greater voice in their states, even if much ultimately depends on individual factors in each situation and the attitudes by the majority population and the political center.
This paper seeks to apply the principles of the famous 3-Circle-Model devised for the description of the ecolinguistic position of English world-wide to the position of German around the world.
On the one hand, the 3-Circle-Model for English with its "Inner", "Outer" and "Extended/Expanding" Circles was invented by Kachru in the 1980s and has since then been adopted, refined and criticised by numerous authors. The situation of German world-wide, on the other hand, has only been scarcely discussed in the past 20 years. While the global extension of German is obviously by far weaker than that of English, there are also a number of noteworthy similarities in terms of historical spread and the current position of these two languages.
This paper therefore discusses the analogies of global English and German by establishing three circles for German: the Inner Circle for the core German-speaking area, i.e. Germany, Austria and Switzerland; the Outer Circle including a number of German minority areas (mostly in Europe), and finally the Extended Circle which may be denoted as "Crumbling" rather than "Expanding". The latter comprises traditional German diaspora communities in different parts of the world which either result from migration, but also reflect the previous functions of German as a language of culture and as a lingua franca in regions like Eastern Europe. The paper argues that there are some striking structural similarities, but also shows the limits of this comparison.
This chapter investigates policies which shape the role of the German language in contemporary Estonia. Whereas German played for many centuries an important role as the language of the economic and cultural elite in Estonia, it severely declined in importance throughout the twentieth century. Mirrored on this historical background, the paper provides an overview of the current functions of German and attitudes towards it and it discusses how these functions and attitudes are influenced by policies of various actors from inside and outside Estonia. The paper argues that German continues to play a significant role: while German is no longer a lingua franca, it still enjoys a number of functions and prestige in clearly defined niches involving communication within German-speaking circles or between Estonians and Germans. The interplay of language policies of the Estonian and the German-speaking states as well as by semi-state and private institutions succeed in maintaining German as an additional language in contemporary Estonia.
This chapter introduces readers to the context and concept of this volume. It starts by providing an historical overview of languages and multilingualism in Lithuania, Estonia and Latvia, highlighting the 100th anniversary of statehood which the three Baltic states are celebrating in 2018. Then, the chapter briefly presents important strands of research on multilingualism in the region throughout the past decades; in particular, questions about language policies and the status of the national languages (Estonian, Latvian and Lithuanian) and Russian. It also touches on debates about languages in education and the roles of other languages such as the regional languages of Latgalian and Võro and the changing roles of international languages such as English and German. The chapter concludes by providing short summaries of the contributions to this book.
Studies on the Linguistic Landscapes (LLs) investigate frequencies, functions, and power relations between languages and their speakers in public space. Research on the LL thereby aims to understand how the production and perception of signs reflect and simultaneously shape realities. In this sense, the LL is one of the most dynamic places where processes of minoritization take place: the (in)visibility of minority languages and the functional and symbolic relationships to majority languages are in direct relationship with negotiations of minorities’ place in society. This chapter looks at minority languages in the LL from two major perspectives. Firstly, it discusses language policies, focussing on which policy categories and which domains of language use are of particular relevance for understanding minority languages in the LL. Then, it turns to issues of conflict, contestation, and exclusion by providing examples from a range of geographically and typologically prototypical case studies, including Israel, Canada, Belgium, the Basque Country, and Friesland.
This paper discusses how the regional language of Latgalian in Latvia has benefitted from societal discourse on the antagonism between speakers of Latvian and Russian in Latvia. Triggered by the 2012 referendum on Russian as a possible second state language of Latvia, Latvian politics (exemplified by politicians' statements since 2012 as well as by 2014 election manifestoes) as well as society at large (displayed by e.g. increased attention in the educational sector and the media) have started to devote considerably more attention to the region of Latgale, including its cultural and linguistic heritage. The paper thereby argues that speakers of Latgalian have gained a noteworthy increase in voice, even though the future of the variety is still considered to be uncertain.
Research on language politics, policy, and planning is of importance to contact linguistics, since political relations between groups of language users, the way in which the use of language(s) is organized, and how language issues are politicized fundamentally shape the political and social conditions under which language varieties are in contact. This chapter first provides a short sketch of how language policy, planning, and politics have so far been conceptualized. Major subfields will be discussed, and then relevant actors and factors in these processes will be introduced. At the end, these aspects will be discussed from a contact linguistic perspective and summarized in a graphic visualization.
Numerous academics and politicians have in recent years contributed to the description and analysis of language policy for the benefit of smaller languages. The present paper tries to add to these by taking up the notion of yet another aspect of politics and language, exemplified by two case studies. The political aspect is the decentralization of parliamentary power for the benefit of minority languages. The two case studies deal with the relationship between the Scottish Parliament and the Gaelic language on the one hand, and between the Norwegian Sarni Parliament, the Sameting, and the Sami language on the other hand. The underlying idea is to consider whether parliamentary bodies may contribute to the empowerment of speakers of minority languages regarding the language of individual choice in as many instances as possible. This applies to any domain of language use, but in particular public bodies, education, and the media, at local, regional and national levels.
Lexical data API
(2022)
This API provides data from various dictionary resources of K Dictionaries across 50 languages. It is used by language service providers, app developers, and researchers, and returns data as JSON documents. A basic search result consists of an object containing partial lexical information on entries that match the search criteria, but further in-depth information is also available. Basic search parameters include the source resource, source language, and text (lemma), and the entries are returned as objects within the results array. It is possible to look for words with specific syntactic criteria, specifying the part of speech, grammatical number, gender and subcategorization, monosemous or polysemous entries. When searching by parameters, each entry result contains a unique entry ID, and each sense has its own unique sense ID. Using these IDs, it is possible to obtain more data – such as syntactic and semantic information, multiword expressions, examples of usage, translations, etc. – of a single entry or sense. The software demonstration includes a brief overview of the API with practical examples of its operation.
This article examines how the most frequent imperative forms of the verb to show in German (zeig mal) and Czech (ukaž) are deployed in object-centred sequences. Specifically, it focuses on smartphone-based showing activities as these were the main sequential environments of show imperatives in the datasets investigated. In both languages, the imperative form does not merely aim to elicit a responsive action from the smartphone holder (such as making the device available) but projects an individual course of action from the requester’s side in the form of an immediate visual inspection of the digital content. This inspection is carried out as part of a joint course of action, allowing the recipient to provide a more detailed response to a prior action. Therefore, this specific imperative form is proven to be cross-linguistically suited to technology-mediated inspection sequences.
Automatic summarization systems usually are trained and evaluated in a particular domain with fixed data sets. When such a system is to be applied to slightly different input, labor- and cost-intensive annotations have to be created to retrain the system. We deal with this problem by providing users with a GUI which allows them to correct automatically produced imperfect summaries. The corrected summary in turn is added to the pool of training data. The performance of the system is expected to improve as it adapts to the new domain.
In this paper we present work in developing a computerized grammar for the Latin language. It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism. The grammar presented here provides a useful resource for natural language processing applications in different fields. It can be easily adopted for language learning and use in language technology for Cultural Heritage like translation applications or to support post-correction of document digitization.
Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations
(2009)
In this paper we show that the extraction of opinions from free-text reviews can improve the accuracy of movie recommendations. We present three approaches to extract movie aspects as opinion targets and use them as features for the collaborative filtering. Each of these approaches requires different amounts of manual interaction. We collected a data set of reviews with corresponding ordinal (star) ratings of several thousand movies to evaluate the different features for the collaborative filtering. We employ a state-of-the-art collaborative filtering engine for the recommendations during our evaluation and compare the performance with and without using the features representing user preferences mined from the free-text reviews provided by the users. The opinion mining based features perform significantly better than the baseline, which is based on star ratings and genre information only.
We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.
The demo presents a minimalist, off-the-shelf AND tool which provides a fundamental AND operation, the comparison of two publications with ambiguous authors, as an easily accessible HTTP interface. The tool implements this operation using standard AND functionality, but puts particular emphasis on advanced methods from natural language processing (NLP) for comparing publication title semantics.
Lexical resources are often represented in table form, e. g., in relational databases, or represented in specially marked up texts, for example, in document based XML models. This paper describes how it is possible to model lexical structures as graphs and how this model can be used to exploit existing lexical resources and even how different types of lexical resources can be combined.
In this contribution we present some work of the R&D European project “LIRICS” and of the ISO/TC 37/SC 4 committee related to the topic of interoperability and re-use of language resources. We introduce some basic mechanisms of the standardization work in ISO and describe in more details the general approach on how to cope with the annotation of language data within ISO.
In this chapter, I will focus on the phenomenon of drop out, i.e., withdrawal from the turn due to overlapping talk, in order to reflect on the link between “unfinished” turns and participation framework. With the help of a sequential and multimodal analysis inspired by the conversation analytical approach, I will show that dropping out from a turn is strongly linked to the availability displayed by potential recipients of a turn-at-talk. Although conversation analysis has described in detail the systematics of overlapping talk, especially of its onset (Jefferson 1973, 1983, 1986) and its resolution (Scheg-loff 2000; Jefferson 2004), the phenomenon of withdrawal from a turn due to simultaneous talk has not been investigated in detail. While it seems to bedifficult to describe this interactional practice by referring exclusively to syntactic features (incompleteness of the turn), I suggest looking at turn withdrawal from a multimodal perspective (e.g. Goodwin 1980, 1981; Mondada2007a; Schmitt 2005), taking into account visible resources like gaze or gesture. The problem of continuing or stopping a turn-in-progress in overlapping talk can be closely linked to the participation framework (Goodwin and Goodwin 2004), as speakers do visibly take into account their recipient’s availability and coordinate their turn construction with the dynamic changes of the participation framework and the interactional space.
Drawing on naturalistic video and audio recordings of international meetings, and within the framework of conversation analysis, ethnomethodology and interactional linguistics, this chapter studies how multilingual resources are mobilized in social interactions among professionals, how available linguistic and embodied resources are identified and used by the participants, which solutions are locally elaborated by them when they are confronted with various languages spoken but not shared among them, and which definition of multilingualism they adopt for all practical purposes. Focusing on the multilingual solutions emically elaborated in international professional meetings, we show that the participants orient to a double principle: on the one hand, they orient to the progressivity of the interaction, adopting all the possible resources that enable them to go on within the current activity; on the other hand, they orient to the intersubjectivity of the interaction, treating, preventing and repairing possible troubles and problems of understanding. Specific multilingual solutions can be adopted to keep this difficult balance between progressivity and intersubjectivity; they vary according to the settings, the competences at hand, the linguistic and embodied resources locally defined by the participants as publicly available, the multilingual resources treated as totally or partially shared, as transparent or opaque, and as needing repair or not. The paper begins by sketching the analytical framework, including the methodology and the data collected; it then presents some general findings, before offering an analysis of various ways in which participants keep the balance between progressivity and intersubjectivity in different multilingual interactional contexts.
Since Lerner coined the notion of delayed completion in 1989, this recurrent social practice of continuing one’s speaking turn while disregarding an intermediate co-participant’s utterance has not been investigated with regard to embodied displays and actions. A sequential approach to videotaped mundane conversations in German will explain the occurrence and use of delayed completions. First, especially in multi-party and multi-activity settings, delayed completions can result from reduced monitoring and coordinating activities. Second, recipients can use intra-turn response slots for more extended responsive actions than the current speaker initially projected, leading to delayed completion sequences. Finally, delayed completions are used for blocking possibly misaligned co-participant actions. The investigation of visible action illustrates that delayed completions are a basic practice for retrospectively managing co-participant response slots.
The use of digital resources and tools across humanities disciplines is steadily increasing, giving rise to new research paradigms and associated methods that are commonly subsumed under the term digital humanities. Digital humanities does not constitute a new discipline in itself, but rather a new approach to humanities research that cuts across different existing humanities disciplines. While digital humanities extends well beyond language-based research, textual resources and spoken language materials play a central role in most humanities disciplines.
The ISOcat registry reloaded
(2012)
The linguistics community is building a metadata-based infrastructure for the description of its research data and tools. At its core is the ISOcat registry, a collaborative platform to hold a (to be standardized) set of data categories (i.e., field descriptors). Descriptors have definitions in natural language and little explicit interrelations. With the registry growing to many hundred entries, authored by many, it is becoming increasingly apparent that the rather informal definitions and their glossary-like design make it hard for users to grasp, exploit and manage the registry’s content. In this paper, we take a large subset of the ISOcat term set and reconstruct from it a tree structure following the footsteps of schema.org. Our ontological re-engineering yields a representation that gives users a hierarchical view of linguistic, metadata-related terminology. The new representation adds to the precision of all definitions by making explicit information which is only implicitly given in the ISOcat registry. It also helps uncovering and addressing potential inconsistencies in term definitions as well as gaps and redundancies in the overall ISOcat term set. The new representation can serve as a complement to the existing ISOcat model, providing additional support for authors and users in browsing, (re-)using, maintaining, and further extending the community’s terminological metadata repertoire.
The transfer of research data management from one institution to another infrastructural partner is all but trivial, but can be required, for instance, when an institution faces reorganization or closure. In a case study, we describe the migration of all research data, identify the challenges we encountered, and discuss how we addressed them. It shows that the moving of research data management to another institution is a feasible, but potentially costly enterprise. Being able to demonstrate the feasibility of research data migration supports the stance of data archives that users can expect high levels of trust and reliability when it comes to data safety and sustainability.
The chapter on formats and models for lexicons deals with different available data formats of lexical resources. It elaborates on their structure and possible uses. Motivated by the restrictions in merging different lexical resources based on widely spread formalisms and international standards, a formal lexicon model for lexical resources is developed which is related to graph structures in annotations. For lexicons this model is termed the Lexicon Graph. Within this model the concepts of lexicon entries and lexical structures frequently described in the literature are formally defined and examples are given. The article addresses the problem of ambiguity in those formal terms. An implementation based on XML and XML technology such as XQuery for the defined structures is given. The relation to international standards is included as well.
This chapter focuses on the contributions of German scholars to two of the three main research questions that have defined EU studies. Leaving aside the debate on the drivers of European integration, i.e. European integration theory, we will discuss the «governance turn» Fritz Scharpf, Beate Kohler-Koch, Arthur Benz, Ingeborg Tömmel and others promoted in studying EU institutions as well as the more policy-oriented approaches by Adrienne Héritier and again Fritz Scharpf and their students. We will then address the ever-growing literature on Europeanization on how EU policies, institutions and political processes have been affecting the domestic structures of member states, membership candidates, as well as neighborhood and third countries. In this context, German scholars also contributed to EU studies in what could be coined in methodological rather than substantial terms. Whereas Thomas König, Gerald Schneider, and others promoted the application of quantitative approaches, scientists like Bernhard Ebbinghaus and Markus Haverland dealt with general questions on research designs like case selection and causal inference. Finally, we will also discuss German contributions to diffusion research. The European Union as a most likely case for the diffusion of policies has attracted considerable attention by scholars dealing with the question of when and how policies spread across time and space. So it comes as no surprise that EU studies as well as diffusion research mutually benefitted from each other. In this regard, German scientists like Katharina Holzinger, Christoph Knill, Tanja Börzel, Thomas Plümper, Thomas Risse and others played a prominent role, too.
Although there is a growing interest of policy makers in higher education issues (especially on an international scale), there is still a lack of theoretically well-grounded comparative analyses of higher education policy. Even broadly discussed topics in higher education research like the potential convergence of European higher education systems in the course of the Bologna Process suffer from a thin empirical and comparative basis. This paper aims to deal with these problems by addressing theoretical questions concerning the domestic impact of the Bologna Process and the role national factors play in determining its effects on cross-national policy convergence. It develops a distinct theoretical approach for the systematic and comparative analysis of cross-national policy convergence. In doing so, it relies upon insights from related research areas — namely literature on Europeanization as well as studies dealing with cross-national policy convergence.
Dieser Beitrag gibt einen Überblick über CoDII, die Collection of Distributionally Idiosyncratic Items. CoDII ist eine elektronische Sammlung verschiedener Untergruppen lexikalischer Elemente, die sich durch idiosynkratische Distribution auszeichnen. Das bedeutet, dass sich die Verteilung dieser Lexeme im Text nicht alleine aufgrund ihrer syntaktischen Kategorie Vorhersagen lässt. Die Methoden, die in der Entwicklung von CoDII angewandt werden, greifen über traditionelle Fachgrenzen hinaus und umfassen Korpuslinguistik, Computerlinguistik, Phraseologie und theoretische Sprachwissenschaft. Ein wichtiger Schwerpunkt unserer Diskussion liegt auf der Darstellung, inwiefern die in CoDII gesammelten, annotierten und unter anderem mit Suchwerkzeugen abfragbaren Daten dazu beitragen können, die linguistische Theoriebildung durch die Bereitstellung sorgfältig aufbereiteter Datensammlungen bei der Überprüfung ihrer Datengrundlage zu unterstützen.
Syntactic theory has tended to vacillate between implausible methodological extremes. Some linguists hold that our theories are accountable solely for the corpus of attested utterances; others assume our subject matter is unobservable intuitive feelings about sentences. Both extremes should be rejected. The subject matter of syntax is neither past utterance production nor the functioning of inaccessible mental machinery; it is normative - a system of tacitly grasped constraints defining correctness of structure. There are interesting parallels between syntactic and moral systems, modulo the key difference that linguistic systems are diverse whereas morality is universal. The appropriate epistemology for justifying formulations of normative systems is familiar in philosophy: it is known as the method of reflective equilibrium.
This paper argues that a lectometric approach may shed light on the distinction between destandardization and demotization, a pair of concepts that plays a key role in ongoing discussions about contemporary trends in standard languages. Instead of a binary distinction, the paper proposes three different types of destandardization, defined as quantitatively measurable changes in a stratigraphic language continuum. The three types are illustrated on the basis of a case study describing changes in the vocabulary of Dutch in The Netherlands and Flanders between 1990 and 2010.
On valence-binding grammars
(1978)
The valence of a verb determines the number, and the syntactic class, of those expressions that must co-occur with it in a sentence. Definitions of "valence-term" and "valence-boundness" are provided whereby the precise conditions are formulated that a valence-binding grammar must satisfy. These conditions are exemplified in the framework of a simple categorial grammar, in which various reductions of the general notions can be carried out.
The paper discusses particular logical consistency conditions satisfied by German proposition-embedding predicates which determine the question type (external and internal whether-form as well as exhaustive and non-exhaustive wh-form), the correlate type (es- or da-correlate) as well as the impact of the correlate on the respective consistency condition. It will turn out that some consistency conditions also determine the embedding of verb second and subject-control.
This paper focuses on the interaction of interrogativity and information structure in Slavic polarity questions where the clitic li may indicate interrogativity as well as focusation. We will see how the semantic category sentence force as well as the pragmatically induced information structuring are anchored syntactically and represented semantically. Even though we will introduce two notions of li for methodological reasons, there is only one li in each language. Within the framework of Rizzi's (1997) theory of the split C-Domain, we will see that li only occurs in Force[0] in Russian and Serbian/Croatian indicating that li is some kind of complementizer. In Bulgarian and Macedonian, on the other hand, li is generated more or less 'independently' from Force[0], but forces the constituent it is adjoined to to move up to FocP. We will further show that Rizzi's theory also accounts for the compositional derivation of meaning of yes/no-interrogatives with information or identificational focus.
Departing from Rooth's focus interpretation theory the article discusses two types of (German) ellipsis phenomena: direct alternative and implicit alternative coordinative ellipsis. For the first type, which includes Stripping, Gapping, ATB, and RNR, it is characteristic that the semantic value of either conjunct instantiates the context variable of the respective focus operator in the other. For German Polarity ellipsis and Sluicing, which constitute the other type, it is characteristic that the semantic value, which instantiates the variable given by the focus operator in the second conjunct, must be derived from the semantic value of the first conjunct and that the second conjunct always hosts an alternative set inducing item which demands new information focus in the first conjunct.
Perhaps the biggest challenge in derivational morphology is to reconcile morphological idiosyncrasy with semantic regularity. How can it be explained that words with dead affixes and irregulär allomorphy can nonetheless exhibit straightforward and stable semantic relations to their etymological bases (cf. strength ‘property of being strong’, obedience ‘act of obeying’, ‘property of being obedient’)? Theories based on the idea of capturing regularity in terms of synthetic rules for building up complex words out of morphemes along with rules for interpreting such structures in a compositional fashion have not made - and arguably cannot make - sense of this phenomenon. Taking the perspective of the learner in acquisition, I propose an alternative approach to meaning assignment based, not on syntagmatic relations among their constituent morphemes, but on paradigmatic relations between whole words. This approach not only explains the conditions under which meaning relations between words are expected to be stable but also accounts for another notorious mystery in derivational morphology, the frequent occurrence of total synonymy among affixes, as opposed to words.
The planning of a dictionary should consider both theoretical and empiric aspects, either for its macro- and microstructure: this is true also for Online Specialized Dictionaries of Linguistics. In particular the microstructure should be standardized and structured so as to fit with the primary and secondary functions of a dictionary. Unfortunately, empirical studies that investigate Online Specialized Dictionaries of Linguistics are rare, making it unclear which microstructural elements are obligatory and which are facultative. This article will present and comment upon the results of an investigation into a corpus of Online Specialized Dictionaries of Linguistics, focusing attention on these aspects and also the most important theoretical issues. An example taken from DIL, a German-Italian Online Dictionary of Linguistics, will end the article.
DIL is a bilingual (German-Italian) online dictionary of linguistics. It is still under construction and contains 240 lemmas belonging to the subfield of “German as a Foreign Language”, but other subfields are in preparation. DIL is an open dictionary; participation of experts from various subfields is welcome. The dictionary is intended for a user group with different levels of knowledge, therefore it is a multifunctional dictionary. An analysis of existing dictionaries, either in their online or written form, was essential in order to make important decisions for the macro- or microstructure of DIL; the results are discussed. Criteria for the selection of entries and an example of an entry conclude the article.
Are borrowed neologisms accepted more slowly into the German language than German words resulting from the application of wrd formation rules? This study addresses this question by focusing on two possible indicators for the acceptance of neologisms: a) frequency development of 239 German neologisms from the 1990s (loanwords as well as new words resulting from the application of word formation rules) in the German reference corpus DEREKO and b) frequency development in the use of pragmatic markers (‘flags’, namely quotation marks and phrases such as sogenannt ‘so-called’) with these words. In the second part of the article, a psycholinguistic approach to evaluating the (psychological) status of different neologisms and non-words in an experimentally controlled study and plans to carry out interviews in a field test to collect speakers’ opinions on the acceptance of the analysed neologisms are outlined. Finally, implications for the lexicographic treatment of both types of neologisms are discussed.
Construction-based language models assume that grammar is meaningful and learnable from experience. Focusing on five of the most elementary argument structure constructions of English, a large-scale corpus study of child-directed speech (CDS) investigates exactly which meanings/functions are associated with these patterns in CDS, and whether they are indeed specially indicated to children by their caretakers (as suggested by previous research, cf. Goldberg, Casenhiser and Sethuraman 2004). Collostructional analysis (Stefanowitsch and Gries 2003) is employed to uncover significantly attracted verb-construction combinations, and attracted pairs are classified semantically in order to systematise the attested usage patterns of the target constructions. The results indicate that the structure of the input may aid learners in making the right generalisations about constructional usage patterns, but such scaffolding is not strictly necessary for construction learning: not all argument structure constructions are coherently semanticised to the same extent (in the sense that they designate a single schematic event type of the kind envisioned in Goldberg’s [1995] ‘scene encoding hypothesis’), and they also differ in the extent to which individual semantic subtypes predominate in learners’ input
Der Aufsatz nähert sich der Frage, wie Sprachwandel beobachtet und beschrieben werden kann, auf empirischen Wege: Es werden Sprachbiographien von deutschstämmigen Amerikaner(inne)n aus Wisconsin nachgezeichnet. Diese Fallstudien - von denen hier zwei etwas näher beleuchtet werden - lassen ganz unterschiedliche Entwicklungen in der Lebenszeit eines Sprechers erkennbar werden. Der Beibehaltung und behutsamen Wandlung im Sprachgebrauch einer schweizerdeutschen Sprecherin steht der beinahe komplette Verlust der deutschen Sprachkompetenz einer Niederdeutsch-Sprecherin gegenüber.
Für die Rekonstruktion dieser Wandlungsprozesse in realer Zeit wird die Methode des Re-Recordings präsentiert - der erneuten Aufnahme von Sprechern, die in früheren Tonaufnahme-Aktionen in Wisconsin bereits einmal erfasst wurden (hier: 1968 und 2001). Erste Ergebnisse der zu Grunde liegenden linguistischen Analysen werden in Tabellen dargestellt.
Psychological research has neglected people whose accent does not match their appearance. Most research on person perception has focused on appearance, overlooking accents that are equally important social cues. If accents were studied, it was often done in isolation (i.e., detached from appearance). We examine how varying accent and appearance information about people affects evaluations. We show that evaluations of expectancy-violating people shift in the direction of the added information. When a job candidate looked foreign, but later spoke with a native accent, his evaluations rose and he was evaluated best of all candidates (Experiment 1a). However, the sequence in which information was presented mattered: When heard first and then seen, his evaluations dropped (Experiment 1b). Findings demonstrate the importance of studying the combination and sequence of different types of information in impression formation. They also allow predicting reactions to ethnically mixed people, who are increasingly present in modern societies.
The present thesis investigates the syntagmatic relations of certain Finnish emotion verbs that are formed by the derivational suffix -ua/-yä (e.g. suuttua ‘get angry’, pelästyä ‘get frightened’). Prototypically, the suffix expresses reflexivity, but in the case of the “inchoative” emotion verbs, it indicates a change of state on behalf of the experiencer, from a non-emotional state to an emotional state.
Sentiment Analysis is the task of extracting and classifying opinionated content in natural language texts. Common subtasks are the distinction between opinionated and factual texts, the classification of polarity in opinionated texts, and the extraction of the participating entities of an opinion(-event), i.e. the source from which an opinion emanates and the target towards which it is directed. With the emerging Web 2.0 which describes the shift towards a highly user-interactive communication medium, the amount of subjective content on the World Wide Web is steadily increasing. Thus, there is a growing need for automatically processing this type of content which is provided by sentiment analysis. Both natural language processing, which is the task of providing computational methods for the analysis and representation of natural language, and machine learning, which is the task of building task-specific classification models on the basis of empirical data, may be instrumental in mastering the challenges of the automatic sentiment analysis of written text. Many problems in sentiment analysis have been proposed to be solved with machine learning methods exclusively using a fairly low-level feature design, such as bag of words, containing little linguistic information. In this thesis, we examine the effectiveness of linguistic features in various subtasks of sentiment analysis. Thus, we heavily draw from the insights gained by natural language processing. The application of linguistic features can be applied on various classification methods, be it in rule-based classification, where the linguistic features are directly encoded as a classifier, in supervised machine learning, where these features complement basic low-level features, or in bootstrapping methods, where these features form a rule-based classifier generating a labeled training set from which a supervised classifier can be trained. In this thesis, we will in particular focus on scenarios where the combination of linguistic features and machine learning methods is effective. We will look at common text classification tasks, both coarse-grained and fine-grained, and extraction tasks.
While good results have been achieved for named entity recognition (NER) in supervised settings, it remains a problem that for low resource languages and less studied domains little or no labelled data is available. As NER is a crucial preprocessing step for many natural language processing tasks, finding a way to overcome this deficit in data remains of great interest. We propose a distant supervision approach to NER that is both language and domain independent where we automatically generate labelled training data using gazetteers that we previously extracted from Wikipedia. We test our approach on English, German and Estonian data sets and contribute further by introducing several successful methods to reduce the noise in the generated training data. The tested models beat baseline systems and our results show that distant supervision can be a promising approach for NER when no labelled data is available. For the English model we also show that the distant supervision model is better at generalizing within the same domain of news texts by comparing it against a supervised model on a different test set.
Opinion holder extraction is one of the most important tasks in sentiment analysis. We will briefly outline the importance of predicates for this task and categorize them according to part of speech and according to which semantic role they select for the opinion holder. For many languages there do not exist semantic resources from which such predicates can be easily extracted. Therefore, we present alternative corpus-based methods to gain such predicates automatically, including the usage of prototypical opinion holders, i.e. common nouns, denoting for example experts or analysts, which describe particular groups of people whose profession or occupation is to form and express opinions towards specific items.
In der natürlichen Sprachverarbeitung haben Frage-Antwort-Systeme in der letzten Dekade stark an Bedeutung gewonnen. Vor allem durch robuste Werkzeuge wie statistische Syntax-Parser und Eigennamenerkenner ist es möglich geworden, linguistisch strukturierte Informationen aus unannotierten Textkorpora zu gewinnen. Zusätzlich werden durch die Text REtrieval Conference (TREC) jährlich Maßstäbe für allgemeine domänen-unabhängige Frage-Antwort-Szenarien definiert. In der Regel funktionieren Frage-Antwort-Systeme nur gut, wenn sie robuste Verfahren für die unterschiedlichen Fragetypen, die in einer Fragemenge vorkommen, implementieren. Ein charakteristischer Fragetyp sind die sogenannten Ereignisfragen. Obwohl Ereignisse schon seit Mitte des vorigen Jahrhunderts in der theoretischen Linguistik, vor allem in der Satzsemantik, Gegenstand intensive Forschung sind, so blieben sie bislang im Bezug auf Frage-Antwort-Systeme weitgehend unerforscht. Deshalb widmet sich diese Diplomarbeit diesem Problem. Ziel dieser Arbeit ist zum Einen eine Charakterisierung von Ereignisstruktur in Frage-Antwort Systemen, die unter Berücksichtigung der theoretischen Linguistik sowie einer Analyse der TREC 2005 Fragemenge entstehen soll. Zum Anderen soll ein Ereignis-basiertes Antwort-Extraktionsverfahren entworfen und implementiert werden, das sich auf den Ergebnissen dieser Analyse stützt. Informationen von diversen linguistischen Ebenen sollen daten-getrieben in einem uniformen Modell integriert werden. Spezielle linguistische Ressourcen, wie z.B. WordNet und Subkategorisierungslexika werden dabei eine zentrale Rolle einnehmen. Ferner soll eine Ereignisstruktur vorgestellt werden, die das Abpassen von Ereignissen unabhängig davon, ob sie von Vollverben oder Nominalisierungen evoziert werden, erlaubt. Mit der Implementierung eines Ereignis-basierten Antwort-Extraktionsmoduls soll letztendlich auch die Frage beantwortet werden, ob eine explizite Ereignismodellierung die Performanz eines Frage-Antwort-Systems verbessern kann.
Just like most varieties of West Germanic, virtually all varieties of German use a construction in which a cognate of the English verb 'do' (standard German 'tun') functions as an auxiliary and selects another verb in the bare infinitive, a construction known as 'do'-periphrasis or 'do'-support. The present paper provides an Optimality Theoretic (OT) analysis of this phenomenon. It builds on a previous analysis by Bader and Schmid (An OT-analysis of 'do'-support in Modern German, 2006) but (i) extends it from root clauses to subordinate clauses and (ii) aims to capture all of the major distributional patterns found across (mostly non-standard) varieties of German. In so doing, the data are used as a testing ground for different models of German clause structure. At first sight, the occurrence of 'do' in subordinate clauses, as found in many varieties, appears to support the standard CP-IP-VP analysis of German. In actual fact, however, the full range of data turn out to challenge, rather than support, this model. Instead, I propose an analysis within the IP-less model by Haider (Deutsche Syntax - generativ. Vorstudien zur Theorie einer projektiven Grammatik, Narr, Tübingen, 1993 et seq.). In sum, the 'do'-support data will be shown to have implications not only for the analysis of clause structure but also for the OT constraints commonly assumed to govern the distribution of 'do', for the theory of non-projecting words (Toivonen in Non-projecting words, Kluwer, Dordrecht, 2003) as well as research on grammaticalization.
In this paper, I present some aspects of a youth group’s construction of a communicative style and show how the group’s stylistic repertoire changes over the course of their growing into adulthood. My paper is based on an ethnographic case study of a group of Turkish girls, the ‘Powergirls’, who grew up in a typical Turkish migrant neighborhood in the inner city of Mannheim, Germany. The aim of the case study was, on the basis of biographical interviews with group members and long-term observation of group interactions, to reconstruct the formation of an ethnically defined ‘ghetto’-clique and its style of communication and to describe the group’s development into educated, modern, German-Turkish young women. In this process, a change in the group’s stylistic repertoire could be observed.
We present a technique called event mapping that allows to project text representations into event lists, produce an event table, and derive quantitative conclusions to compare the text representations. The main application of the technique is the case where two classes of text representations have been collected in two different settings (e.g., as annotations in two different formal frameworks) and we can compare the two classes with respect to their systematic differences in the event table. We illustrate how the technique works by applying it to data collected in two experiments (one using annotations in Vladimir Propp’s framework, the other using natural language summaries).
We explore the feasibility of contextual healthiness classification of food items. We present a detailed analysis of the linguistic phenomena that need to be taken into consideration for this task based on a specially annotated corpus extracted from web forum entries. For automatic classification, we compare a supervised classifier and rule-based classification. Beyond linguistically motivated features that include sentiment information we also consider the prior healthiness of food items.
One problem of data-driven answer extraction in open-domain factoid question answering is that the class distribution of labeled training data is fairly imbalanced. In an ordinary training set, there are far more incorrect answers than correct answers. The class-imbalance is, thus, inherent to the classification task. It has a deteriorating effect on the performance of classifiers trained by standard machine learning algorithms. They usually have a heavy bias towards the majority class, i.e. the class which occurs most often in the training set. In this paper, we propose a method to tackle class imbalance by applying some form of cost-sensitive learning which is preferable to sampling. We present a simple but effective way of estimating the misclassification costs on the basis of class distribution. This approach offers three benefits. Firstly, it maintains the distribution of the classes of the labeled training data. Secondly, this form of meta-learning can be applied to a wide range of common learning algorithms. Thirdly, this approach can be easily implemented with the help of state-of-the-art machine learning software.
In recent years, text classification in sentiment analysis has mostly focused on two types of classification, the distinction between objective and subjective text, i.e. subjectivity detection, and the distinction between positive and negative subjective text, i.e. polarity classification. So far, there has been little work examining the distinction between definite polar subjectivity and indefinite polar subjectivity. While the former are utterances which can be categorized as either positive or negative, the latter cannot be categorized as either of these two categories. This paper presents a small set of domain independent features to detect indefinite polar sentences. The features reflect the linguistic structure underlying these types of utterances. We give evidence for the effectiveness of these features by incorporating them into an unsupervised rule-based classifier for sentence-level analysis and compare its performance with supervised machine learning classifiers, i.e. Support Vector Machines (SVMs) and Nearest Neighbor Classifier (kNN). The data used for the experiments are web-reviews collected from three different domains.
Traditionally, research on language change has been a post-mortem activity, focused on isolated changes that are complete and often only documented in written texts. In the 1960s the field was advanced considerably by Labovian sociolinguistics and the investigation of “change in progress” adduced through patterns of community-internal linguistic variation correlated with external facts about speakers such as age and class (see Labov 1994 for an overview). However, despite the many benefits of such work on “dynamic synchrony,” we still know relatively little about how language change unfolds over the lifetimes of individual speakers, that is, in real time (cf. Bailey et al. 1991). The logistical challenges of such research are, of course, considerable. Whereas it is straightforward for psycholinguists to observe language development in children over the course of a few years, documenting changes in the verbal behavior of individuals over several decades is by contrast much less feasible. Nevertheless, present theoretical models of language change could be considerably improved by the results of real-time studies.
In this paper, we investigate the role of predicates in opinion holder extraction. We will examine the shape of these predicates, investigate what relationship they bear towards opinion holders, determine what resources are potentially useful for acquiring them, and point out limitations of an opinion holder extraction system based on these predicates. For this study, we will carry out an evaluation on a corpus annotated with opinion holders. Our insights are, in particular, important for situations in which no labelled training data are available and only rule-based methods can be applied.
In this paper, we compare three different generalization methods for in-domain and cross-domain opinion holder extraction being simple unsupervised word clustering, an induction method inspired by distant supervision and the usage of lexical resources. The generalization methods are incorporated into diverse classifiers. We show that generalization causes significant improvements and that the impact of improvement depends on the type of classifier and on how much training and test data differ from each other. We also address the less common case of opinion holders being realized in patient position and suggest approaches including a novel (linguistically-informed) extraction method how to detect those opinion holders without labeled training data as standard datasets contain too few instances of this type.
We present the German Sentiment Analysis Shared Task (GESTALT) which consists of two main tasks: Source, Subjective Expression and Target Extraction from Political Speeches (STEPS) and Subjective Phrase and Aspect Extraction from Product Reviews (StAR). Both tasks focused on fine-grained sentiment analysis, extracting aspects and targets with their associated subjective expressions in the German language. STEPS focused on political discussions from a corpus of speeches in the Swiss parliament. StAR fostered the analysis of product reviews as they are available from the website Amazon.de. Each shared task led to one participating submission, providing baselines for future editions of this task and highlighting specific challenges. The shared task homepage can be found at https://sites.google.com/site/iggsasharedtask/.
We present an experimental approach to determining natural dimensions of story comparison. The results show that untrained test subjects generally do not privilege structural information. When asked to justify sameness ratings, they may refer to content, but when asked to state differences, they mostly refer to style, concrete events, details and motifs. We conclude that adequate formal models of narratives must represent such non-structural data.
From Proof Texts to Logic. Discourse Representation Structures for Proof Texts in Mathematics
(2009)
We present an extension to Discourse Representation Theory that can be used to analyze mathematical texts written in the commonly used semi-formal language of mathematics (or at least a subset of it). Moreover, we describe an algorithm that can be used to check the resulting Proof Representation Structures for their logical validity and adequacy as a proof.
Repairs for Reasoning
(2013)
We describe and experimentally investigate phenomena of modal enrichment, that is, phenomena in which a recipient non-literally interprets an utterance by creating and applying a modal operator. We give competing explanations for these phenomena - namely an explanation according to which modal enrichment is a repair procedure for making the utterance match a script of information processing vs. an explanation according to which modal enrichment is triggered by rhetorical structure.
Accentuation, Uncertainty and Exhaustivity - Towards a Model of Pragmatic Focus Interpretation
(2010)
This paper presents a model of pragmatic focus interpretation that is assumed to be part of a complete language comprehension model and that is inspired by Levelt's language processing model. The model is derived from our empirical data on the role of accentuation, prosodic indicators of uncertainty and context for pragmatic focus interpretation. In its present state, the model is restricted to these data, but nevertheless generates predictions.
A formal narrative representation is a procedure assigning a formal description to a natural language narrative. One of the goals of the computational models of narrative community is to understand this procedure better in order to automatize it. A formal framework fit for automatization should allow for objective and reproducible representations. In this paper, we present empirical work focussing on objectivity and reproducibility of the formal framework by Vladimir Propp (1928). The experiments consider Propp’s formalization of Russian fairy tales and formalizations done by test subjects in the same formal framework; the data show that some features of Propp’s system such as the assignment of the characters to the dramatis personae and some of the functions are not easy to reproduce.
We compare the use of überhaupt and sowieso in Dutch and German. We use the world-wide web as the main resource and pursue a zigzag strategy, trying to find usages going back and forth between dictionaries, intuitions and real data obtained through web search. To our surprise, the results more or less confirm the decision of Dutch dictionaries to consider überhaupt and sowieso synonymous. In German, we find no synonymy, but only a great overlap of usage conditions in declarative sentences.
Knowledge in textual form is always presented as visually and hierarchically structured units of text, which is particularly true in the case of academic texts. One research hypothesis of the ongoing project Knowledge ordering in texts - text structure and structure visualisations as sources of natural ontologies1 is that the textual structure of academic texts effectively mirrors essential parts of the knowledge structure that is built up in the text. The structuring of a modern dissertation thesis (e.g. in the form of an automatically generated table of contents - toes), for example, represents a compromise between requirements of the text type and the methodological and conceptual structure of its subject-matter. The aim of the project is to examine how visual-hierarchical structuring systems are constructed, how knowledge structures are encoded in them, and how they can be exploited to automatically derive ontological knowledge for navigation, archiving, or search tasks. The idea to extract domain concepts and semantic relations mainly from the structural and linguistic information gathered from tables of contents represents a novel approach to ontology learning.
This paper presents challenges and opportunities resulting from the application of geographical information systems (GIS) in the (digital) humanities. First, we provide an overview of the intersection and interaction between geography (and cartography), and the humanities. Second, the “GeoBib” project is used as a case study to exemplify challenges for such collaborative, interdisciplinary projects, both for the humanists and the geoscientists. Finally, we conclude with an outlook on further applications of GIS in the humanities, and the potential scientific benefit for both sides, humanities and geosciences.
In the project SemDok (Generic document structures in linearly organised texts) funded by the German Research Foundation DFG, a discourse parser for a complex type (scientific articles by example), is being developed. Discourse parsing (henceforth DP) according to the Rhetorical Structure Theory (RST) (Mann and Taboada, 2005; Marcu, 2000) deals with automatically assigning a text a tree structure in which discourse segments and rhetorical relations between them are marked, such as Concession. For identifying the combinable segments, declarative rules are employed, which describe linguistic and structural cues and constraints about possible combinations by referring to different XML annotation layers of the input text, and external knowledge bases such as a discourse marker lexicon, a lexico-semantic ontology (later to be combined with a domain ontology), and an ontology of rhetorical relations. In our text-technological environment, the obvious choice of formalism to represent such ontologies is OWL (Smith et al., 2004). In this paper, we describe two OWL ontologies and how they are consulted from the discourse parser to solve certain tasks within DP. The first ontology is a taxononomy of rhetorical relations which was developed in the project. The second one is an OWL version of GermaNet, the model of which we designed together with our project partners.