Refine
Year of publication
Document Type
- Part of a Book (581)
- Conference Proceeding (561)
- Article (453)
- Book (66)
- Working Paper (25)
- Doctoral Thesis (21)
- Other (18)
- Part of Periodical (12)
- Preprint (12)
- Contribution to a Periodical (6)
Language
- English (1764) (remove)
Keywords
- Korpus <Linguistik> (416)
- Deutsch (409)
- Computerlinguistik (161)
- Konversationsanalyse (138)
- Interaktion (116)
- Englisch (112)
- Annotation (97)
- Gesprochene Sprache (93)
- Automatische Sprachanalyse (75)
- Wörterbuch (73)
Publicationstate
- Veröffentlichungsversion (960)
- Zweitveröffentlichung (248)
- Postprint (236)
- Ahead of Print (6)
- Preprint (5)
- Erstveröffentlichung (2)
Reviewstate
- Peer-Review (828)
- (Verlags)-Lektorat (409)
- Peer-review (24)
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (18)
- Verlags-Lektorat (14)
- Peer-Revied (8)
- Review-Status-unbekannt (6)
- Abschlussarbeit (Bachelor, Master, Diplom, Magister) (Bachelor, Master, Diss.) (3)
- (Verlags-)Lektorat (2)
- Peer review (2)
Publisher
- de Gruyter (104)
- Benjamins (87)
- IDS-Verlag (81)
- Springer (63)
- European Language Resources Association (ELRA) (56)
- Association for Computational Linguistics (46)
- European Language Resources Association (42)
- Oxford University Press (35)
- Elsevier (33)
- Institut für Deutsche Sprache (33)
Speakers’ dialogical orientation to the particular others they talk to is implemented by practices of recipient-design. One such practice is the use of negation as a means to constrain interpretations of speaker’s actions by the partner. The paper situates this use of negation within the larger context of other recipient-designed uses of negation which negate assumptions the speaker makes about what the addressee holds to be true (second-order assumptions) or what the addressee assumes the speaker holds to be true (third- order assumptions). The focus of the study is on the ways in which speakers use negation to disclaim interpretations of their turns which partners have displayed or may possibly arrive at. Special emphasis is given to the positionally sensitive uses of negation, which may occur before, after or inserted between the nucleus actions whose interpretation is constrained by the negation. Interactional motivations and rhetorical potentials of the practice are pointed out, partly depending on the position of the negation vis-à-vis the nucleus action. The analysis shows that the concept of ‘recipient design’ is in need of distinctions which have not been in focus in prior research.
"Standard language" is a contested concept, ideologically, empirically and theoretically. This is particularly true for a language such as German, where the standardization of the spoken language was based on the written standard and was established with respect to a communicative situation, i.e. public speech on stage (Bühnenaussprache), which most speakers never come across. As a consequence, the norms of the oral standard exhibit many features which are infrequent in the everyday speech even of educated speakers. This paper discusses ways to arrive at a more realistic conception of (spoken) standard German, which will be termed "standard usage". It must be founded on empirical observations of speakers linguistic choices in everyday situations. Arguments in favor of a corpus-based notion of standard have to consider sociolinguistic, political, and didactic concerns. We report on the design of a large study of linguistic variation conducted at the Institute for the German Language (project "Variation in Spoken German", Variation des gesprochenen Deutsch) with the aim of arriving at a representative picture of "standard usage" in contemporary German. It systematically takes into account both diatopic variation covering the multi-national space in which German an official language, and diastratic variation in terms of varying degrees of formality. Results of the study of phonetic and morphosyntactic variation are discussed. At least for German, a corpus-based notion of "standard usage" inevitably includes some degree of pluralism concerning areal variation, and it needs to do justice to register-based variation as well.
With the advent of mobile devices, mediatized political discourse became more dynamic. I assume that the microblog Twitter can be considered as a medium for spatial coordination during protests. Therefore, the case of neo-Nazi demonstrations and counter-protests in the city of Dresden that occurred in February 2012 is analysed. Data consists of microposts that occurred during the event. Quantitative analysis of hashtag and retweet frequencies was performed as well as qualitative speech act pattern analysis and a tempo-spatial discourse analysis on selected subsets of microposts. Results show that a common linguistic practice is verbal georeferencing and by that constructing space. Empirical analysis indicates a strong relation between communicational online space and physical offline place: Protest participants permanently reconfigure spatial context discursively and thus the contested protest area becomes a temporarily meaningful place.
"What makes this so complicated?" On the value of disorienting dilemmas in language instruction
(2017)
This thesis is a corpus linguistic investigation of the language used by young German speakers online, examining lexical, morphological, orthographic, and syntactic features and changes in language use over time. The study analyses the language in the Nottinghamer Korpus deutscher YouTube‐Sprache ("Nottingham corpus of German YouTube language", or NottDeuYTSch corpus), one of the first large corpora of German‐language comments taken from the videosharing website YouTube, and built specifically for this project. The metadatarich corpus comprises c.33 million tokens from more than 3 million comments posted underneath videos uploaded by mainstream German‐language youthorientated YouTube channels from 2008‐2018.
The NottDeuYTSch corpus was created to enable corpus linguistic approaches to studying digital German youth language (Jugendsprache), having identified the need for more specialised web corpora (see Barbaresi 2019). The methodology for compiling the corpus is described in detail in the thesis to facilitate future construction of web corpora. The thesis is situated at the intersection of Computer‐Mediated Communication (CMC) and youth language, which have been important areas of sociolinguistic scholarship since the 1980s, and explores what we can learn from a corpus‐driven, longitudinal approach to (online) youth language. To do so, the thesis uses corpus linguistic methods to analyse three main areas:
1. Lexical trends and the morphology of polysemous lexical items. For this purpose, the analysis focuses on geil, one of the most iconic and productive words in youth language, and presents a longitudinal analysis, demonstrating that usage of geil has decreased, and identifies lexical items that have emerged as potential replacements. Additionally, geil is used to analyse innovative morphological productiveness, demonstrating how different senses of geil are used as a base lexeme or affixoid in compounding and derivation.
2. Syntactic developments. The novel grammaticalization of several subordinating conjunctions into both coordinating conjunctions and discourse markers is examined. The investigation is supported by statistical analyses that demonstrate an increase in the use of non‐standard syntax over the timeframe of the corpus and compares the results with other corpora of written language.
3. Orthography and the metacommunicative features of digital writing. This analysis identifies orthographic features and strategies in the corpus, e.g. the repetition of certain emoji, and develops a holistic framework to study metacommunicative functions, such as the communication of illocutionary force, information structure, or the expression of identities. The framework unifies previous research that had focused on individual features, integrating a wide range of metacommunicative strategies within a single, robust system of analysis.
By using qualitative and computational analytical frameworks within corpus linguistic methods, the thesis identifies emergent linguistic features in digital youth language in German and sheds further light on lexical and morphosyntactic changes and trends in the language of young people over the period 2008‐2018. The study has also further developed and augmented existing analytical frameworks to widen the scope of their application to orthographic features associated with digital writing.
Positioning analysis, a variant of discourse analysis, was used to explore the narratives of 40 psychiatric patients (11 females and 29 males; mean age = 40 years) who had manifest difficulties with engagement with statutory mental health services. Positioning analysis is a qualitative method that captures how people linguistically position the roles and identities of themselves and others in their day-to-day lives and narratives. The language of disengagement incorporated the passive positioning of self in relation to their lives and treatment through the use of metaphor, the passive voice and them and us attribution, while the discourse of engagement incorporated more active positioning of self, achieved through the use of the personal pronoun we and metaphoric references to balanced relationships. The findings corroborate previous thematic analysis that highlighted the importance of identity and agency in the ‘making or breaking’ of therapeutic relationships (Priebe et al. 2005). Implications are discussed in relation to how positioning analysis may help signal and emphasize important life and therapeutic experiences in spoken narratives as well as clinical consultations.
The purpose of this paper is to describe the functions of ‘where’-based relative elements' in six Balkan languages, paying particular attention to non-standard varieties.2 Relative elements based on an originally interrogative pronoun meaning ‘where’ are attested in all Balkan languages and, more generally, in all European languages. In accordance with the locative meaning of the original pronoun, ‘where’-based relative elements are primarily used to relativize locatives. However, it will be shown that in some Balkan languages, and especially in non-standard varieties, these elements have extended their functional domain. This process does not appear to be random, but rather to pattern with the following hierarchy: locative > unspecific connector > other syntactic positions (indirect/direct object, subject).3 Additionally, ‘where’-based relative elements will be compared with ‘what’-based ones in order to highlight common patterns of development.
The present investigation targets the phenomenon commonly called control. Many languages including German and Polish employ non-finite clauses (besides finite clauses) as propositional complements. The subject of these complement clauses is left unexpressed and must generally be interpreted co-referentially with the subject or object of the matrix clause (subject or object control). However. there are also infinitive-selecting verbs that do not allow for a co- referential interpretation of the embedded subject - semantically, the embedded infinitives of these anti-control verbs are thus less dependent on or less unifiable with the matrix proposition. In Polish anti-control constructions, non-finite complements are overtly marked with the complementizer zeby, suggesting that they are structurally more complex (namely. containing a C-projection) than the non-finite complements in control constructions lacking zeby (modulo special contexts. viz. 'control switch'). In a comparative perspective, the paper brings corpuslinguistic and experimental evidence to bear on the question whether surface appearances notwithstanding, the infinitival complements of anti-control verbs in German should similarly be analyzed as truly sentential, i.e., C-headed structures.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.
This study explores how ‘gatherings’ turn into ‘encounters’ in a virtual world (VW) context. Most communication technologies enable only focused encounters between distributed participants, but in VWs both gatherings and encounters can occur. We present close sequential analysis of moments when after a silent gathering, interaction among participants in a VW is gradually resumed, and also investigate the social actions in the verbal (re-)opening turns. Our findings show that like in face-to-face situations, also in VWs participants often use different types of embodied resources to achieve the transition, rather than rely on verbal means only. However, the transition process in VWs has distinctive characteristics compared to the one in face-to-face situations. We discuss how participants in a VW use virtually embodied pre-beginnings to display what we call encounter-readiness, instead of displaying lack of presence by avatar stillness. The data comprise 40 episodes of video-recorded team interactions in a VW.
This conference booklet provides information about 10th International Contrastive Linguistics Conference (ICLC-10) that took place in Mannheim, Germany, from 18 to 21 July 2023. It contains
– a description of the conference aims,
– details on the conference venue,
– information on committees,
– the conference program,
– the abstracts of the keynotes, oral and poster presentations, and
– an author index.
This paper focusss on the first Slavonic-Romanian lexicons, compiled in the second half of the 17th century and their use(rs), proposing a method of investigating the manner in which lexical information available in the above corpus relates, if at all, to the vocabulary of texts from the same period. We chose to investigate their relation to an anonymous Old Testament translation made from Church Slavonic, also from the second half of the 17th century, which was supposed to be produced in the same geographical area, in the same Church Slavonic school or even by the same author as the lexicons. After applying a lemmatizer on both the Biblical text (Books of Genesis and Daniel) and the Romanian material from the lexicons, we analyse the results and double the statistical analysis with a series of case studies, focusing on some common lexemes that might be an indicator of the relatedness of the texts. Even if the analysis points out that the lexicons might not have been compiled as a tool for the translation of religious texts, it proves to be a useful method that reveals interesting data and provides the basis for more extensive approaches.
This paper presents the application of the <tiger2/> format to various linguistic scenarios with the aim of making it the standard serialisation for the ISO 24615 [1] (SynAF) standard. After outlining the main characteristics of both the SynAF metamodel and the <tiger2/> format, as extended from the initial Tiger XML format [2], we show through a range of different language families how <tiger2/> covers a variety of constituency and dependency based analyses.
A "polyglottal" speech synthesis - modifications for a replica of Kempelen's speaking machine
(2019)
This introductory tutorial describes a strictly corpus-driven approach for uncovering indications for aspects of use of lexical items. These aspects include ‘(lexical) meaning’ in a very broad sense and involve different dimensions, they are established in and emerge from respective discourses. Using data-driven mathematical-statistical methods with minimal (linguistic) premises, a word’s usage spectrum is summarized as a collocation profile. Self-organizing methods are applied to visualize the complex similarity structure spanned by these profiles. These visualizations point to the typical aspects of a word’s use, and to the common and distinctive aspects of any two words.
In the management of cooperation, the fit of a requested action with what the addressee is presently doing is a pervasively relevant consideration. We present evidence that imperative turns are adapted to, and reflexively create, contexts in which the other person is committed to the course of action advanced by the imperative. This evidence comes from systematic variation in the design of imperative turns, relative to the fittedness of the imperatively mandated action to the addressee’s ongoing trajectory of actions, what we call the “dine of commitment”. We present four points on this dine: Responsive imperatives perform an operation on the deontic dimension of what the addressee has announced or already begun to do (in particular its permissibility); local-project imperatives formulate a new action advancing a course of action in which the addressee is already actively engaged; global-project-imperatives target a next task for which the addressee is available on the grounds of their participation in the overall event, and in the absence of any competing work; and competitive imperatives draw on a presently otherwise engaged addressee on the grounds of their social commitment to the relevant course of actions. These four turn shapes are increasingly complex, reflecting the interactional work required to bridge the increasing distance between what the addressee is currently doing, and what the imperative mandates. We present data from German and Polish informal and institutional settings.
This manual introduces a conversation analytically informed coding scheme for episodes involving the direct social sanctioning of problem behavior in informal social interaction which was developed in the project Norms, Rules, and Morality across Languages (NoRM-aL) at the Leibniz-Institute for the German Language. It outlines the background for its development, delimits the phenomena to which the coding scheme can be applied and provides instructions for its use.
The scheme asks for basic information about the recording and the participants involved in the episode, before taking stock of different features of the sanctioning episode as a whole. This is followed by sets of specific coding questions about the sanctioning move itself (such as its timing and composition) and the reaction it engenders. The coding enables researchers to get a bird’s eye view on recurrent features of such episodes in larger quantities of data and allows for comparisons across different languages and informal settings.
To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar research, we used a set of XSLT stylesheets to transform the mediawiki anntations to XML. Furthermore, the data has been amnntated with word class information using different taggers. The outcome is a corpus with rich meta data and linguistic annotation that can be used for multilingual research in various linguistic topics.
A comparison between morphological complexity measures: typological data vs. language corpora
(2016)
Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.
Authors like Fillmore 1986 and Goldberg 2006 have made a strong case for regarding argument omission in English as a lexical and construction-based affordance rather than one based on general semantico-pragmatic constraints. They do not, however, address the question of how grammatical restrictions on null complementation might interact with broader narrative conventions, in particular those of genre. In this paper, we attempt to remedy this oversight by presenting a comprehensive overview of genre-based argument omissions and offering a construction-based analysis of genre-based omission conventions. We consider five genre-based omission types: instructional imperatives (Culy 1996, Bender 1999), labelese, diary style (Haegeman 1990), match reports (Ruppenhofer 2004) and quotative clauses. We show that these omission types share important traits; all, for example, have anaphoric rather than indefinite construals. We also show, however, that the omission types differ from each other in idiosyncratic ways. We then address several interrelated representational problems posed by the grammatical treatment of genre-based omissions. For example, the constructions that represent genre-based omission conventions must interact with the lexical entries of verbs, many of which do not generally permit omitted arguments. Accordingly, we offer constructional analyses of genre-based omissions that allow constructions to override lexical valence constraints.
Song lyrics can be considered as a text genre that has features of both written and spoken discourse, and potentially provides extensive linguistic and cultural information to scientists from various disciplines. However, pop songs play a rather subordinate role in empirical language research so far - most likely due to the absence of scientifically valid and sustainable resources. The present paper introduces a multiply annotated corpus of German lyrics as a publicly available basis for multidisciplinary research. The resource contains three types of data for the investigation and evaluation of quite distinct phenomena: TEI-compliant song lyrics as primary data, linguistically and literary motivated annotations, and extralinguistic metadata. It promotes empirically/statistically grounded analyses of genre-specific features, systemic-structural correlations and tendencies in the texts of contemporary pop music. The corpus has been stratified into thematic and author-specific archives; the paper presents some basic descriptive statistics, as well as the public online frontend with its built-in evaluation forms and live visualisations.
This report presents a corpus of articulations recorded with Schlieren photography, a recording technique to visualize aeroflow dynamics for two purposes. First, as a means to investigate aerodynamic processes during speech production without any obstruction of the lips and the nose. Second, to provide material for lecturers of phonetics to illustrates these aerodynamic processes. Speech production was recorded with 10 kHz frame rate for statistical video analyses. Downsampled videos (500 Hz) were uplodad to a youtube channel for illustrative purposes. Preliminary analyses demonstrate potential in applying Schlieren photography in research.
In this paper, we will present a first attempt to classify commonly confused words in German by consulting their communicative functions in corpora. Although the use of so-called paronyms causes frequent uncertainties due to similarities in spelling, sound and semantics, up until now the phenomenon has attracted little attention either from the perspective of corpus linguistics or from cognitive linguistics. Existing investigations rely on structuralist models, which do not account for empirical evidence. Still, they have developed an elaborate model based on formal criteria, primarily on word formation (cf. Lăzărescu 1999). Looking from a corpus perspective, such classifications are incompatible with language in use and cognitive elements of misuse.
This article sketches first lexicological insights into a classification model as derived from semantic analyses of written communication. Firstly, a brief description of the project will be provided. Secondly, corpus-assisted paronym detection will be focused. Thirdly, in the main section the paper concerns the description of the datasets for paronym classification and the classification procedures. As a work in progress, new insights will continually be extended once spoken and CMC data are added to the investigations.
This paper investigates the conditions that govern the choice between the German neuter singular relative pronouns das ‘that’ and was ‘what’. We show that das requires a lexical head noun, while in all other cases was is usually the preferred option; therefore, the distribution of das and was is most successfully captured by an approach that does not treat was as an exception but analyzes it as the elsewhere case that applies when the relativizer fails to pick up a lexical gender feature from the head noun. We furthermore show how the non-uniform behavior of different types of nominalized adjectives (positives allow both options, while superlatives trigger was) can be attributed to semantic differences rooted in syntactic structure. In particular, we argue that superlatives select was due to the presence of a silent counterpart of the quantifier alles ‘all’ that is part of the superlative structure.
This paper presents a short insight into a new project at the "Institute for the German Language” (IDS) (Mannheim). It gives an insight into some basic ideas for a corpus-based dictionary of spoken German, which will be developed and compiled by the new project "The Lexicon of spoken German” (Lexik des gesprochenen Deutsch, LeGeDe). The work is based on the "Research and Teaching Corpus of Spoken German” (Forschungs- und Lehrkorpus Gesprochenes Deutsch, FOLK), which is implemented in the "Database for Spoken German” (Datenbank für Gesprochenes Deutsch, DGD). Both resources, the database and the corpus, have been developed at the IDS.
This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction.
Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.
Ph@ttSessionz and Deutsch heute are two large German speech databases. They were created for different purposes: Ph@ttSessionz to test Internet-based recordings and to adapt speech recognizers to the voices of adolescent speakers, Deutsch heute to document regional variation of German. The databases differ in their recording technique, the selection of recording locations and speakers, elicitation mode, and data processing.
In this paper, we outline how the recordings were performed, how the data was processed and annotated, and how the two databases were imported into a single relational database system. We present acoustical measurements on the digit items of both databases. Our results confirm that the elicitation technique affects the speech produced, that f0 is quite comparable despite different recording procedures, and that large speech technology databases with suitable metadata may well be used for the analysis of regional variation of speech.
There have been several attempts to annotate communicative functions to utterances of verbal feedback in English previously. Here, we suggest an annotation scheme for verbal and non-verbal feedback utterances in French including the categories base, attitude, previous and visual. The data comprises conversations, maptasks and negotiations from which we extracted ca. 13,000 candidate feedback utterances and gestures. 12 students were recruited for the annotation campaign of ca. 9,500 instances. Each instance was annotated by between 2 and 7 raters. The evaluation of the annotation agreement resulted in an average best-pair kappa of 0.6. While the base category with the values acknowledgement, evaluation, answer, elicit and other achieves good agreement, this is not the case for the other main categories. The data sets, which also include automatic extractions of lexical, positional and acoustic features, are freely available and will further be used for machine learning classification experiments to analyse the form-function relationship of feedback.
In this paper, an exploratory data-driven method is presented that extracts word-types from diachronic corpora that have undergone the most pronounced change in frequency of occurrence in a given period of time. Combined with statistical methods from time series analysis, the method is able to find meaningful patterns and relationships in diachronic corpora, an idea that is still uncommon in linguistics. This indicates that the approach can facilitate an improved understanding of diachronic processes.
The main objective of this article is to describe the current activities at the Mannheim Institute for German Language regarding the implementation of a domain-specific ontology for German grammar. We differentiate ontology bases from ontology management Systems, point out the benefits of database-driven Solutions, and go Step by Step through all phases of the ontology lifecycle. In Order to demonstrate the practical use of our approach, we outline the interface between our ontology and the grammis web Information System, and compare the ontology-based retrieval mechanism with traditional full text search.
We present a descriptive analysis on the two datasets from the shared task on Source, Subjective Expression and Target Extraction from Political Speeches (STEPS), the only existing German dataset for opinion role extraction of its size. Our analysis discusses the individual properties of the three components, subjective expressions, sources and targets and their relations towards each other. Our observations should help practitioners and researchers when building a system to extract opinion roles from German data.
The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g. title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection).
Most cultures have metaphors for time that involve movement, for example, ‘time passes’. Although time is objectively measured, it is subjectively understood, as we can perceive time as stationary, whereby we move towards future events, or we can perceive ourselves as stationary, with time moving past us and events moving towards us. This paper reports a series of studies that first examines whether people think about time in a metaphor-consistent manner (Study 1) and then explores the relationship between ‘time perspective’, level of perceived personal agency, and time representations (Study 2), the relationship between emotional experiences and time representation (Study 3), and whether this relationship is bidirectional by manipulating either emotional experiences (Study 4) or time representation (Study 5). Results provide bidirectional evidence for an ego-moving representation of time, with happiness eliciting more agentic control, and evidence for a time-moving passivity associated with emotional experiences of anxiety and depression. This bidirectional relationship suggests that our representation of time is malleable, and therefore, current emotional experiences may change through modification of time representations.
We present an implemented XML data model and a new, simplified query language for multi-level annotated corpora. The new query language involves automatic conversion of queries into the underlying, more complicated MMAXQL query language. It supports queries for sequential and hierarchical, but also associative (e.g. coreferential) relations. The simplified query language has been designed with non-expert users in mind.
This paper discusses a specific subclass of English it-clefts posited in the theoretical literature, so-called predicational clefts. The main point of the paper is to show that there is no need to postulate such a separate class. Predicational clefts look special because of the narrow focus on the adjective within an indefinite pivot, but their special properties can all be derived from this narrow focus in a focus analysis in which it-clefts express contrasting focus. Contrasting focus means that besides the assertion of the proposition expressed in the cleft, there is one contrasting proposition which is excluded. The focus on the adjective in apparent predicational clefts gives rise to a narrow set of relevant alternatives, all of which differ only in the adjectival property within the pivot. The analysis developed here can account for many of the observations for apparent predicational clefts. Other properties are shown to be not conclusive. Thus, predicational clefts need not be considered a special subclass beyond their special focus characteristics.
A key difference between traditional humanities research and the emerging field of digital humanities is that the latter aims to complement qualitative methods with quantitative data. In linguistics, this means the use of large corpora of text, which are usually annotated automatically using natural language processing tools. However, these tools do not exist for historical texts, so scholars have to work with unannotated data. We have developed a system for systematic iterative exploration and annotation of historical text corpora, which relies on an XML database (BaseX) and in particular on the Full Text and Update facilities of XQuery.
In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that have undergone the most pronounced change in frequency in a given period of time, is computationally very cheap and that it allows interpretations of diachronic trends that are both intuitively plausible and motivated from the perspective of information theory. Furthermore, it demonstrates that the method is able to identify correlated linguistic changes and diachronic shifts that can be linked to historical events. Finally, it can help to improve diachronic POS tagging and complement existing NLP approaches. This indicates that the approach can facilitate an improved understanding of diachronic processes in language change.
Linguistic query systems are special purpose IR applications. We present a novel state-of-the-art approach for the efficient exploitation of very large linguistic corpora, combining the advantages of relational database management systems (RDBMS) with the functional MapReduce programming model. Our implementation uses the German DEREKO reference corpus with multi-layer linguistic annotations and several types of text-specific metadata, but the proposed strategy is language-independent and adaptable to large-scale multilingual corpora.
So far, there have been few descriptions on creating structures capable of storing lexicographic data, ISO 24613:2008 being one of the latest. Another one is by Spohr (2012), who designs a multifunctional lexical resource which is able to store data of different types of dictionaries in a user-oriented way. Technically, his design is based on the principle of a hierarchical XML/OWL (eXtensible Markup Language/Web Ontology Language) representation model. This article follows another route in describing a model based on entities and relations between them; MySQL (usually referred to as: Structured Query Language) describes a database system of tables containing data and definitions of relations between them. The model was developed in the context of the project "Scientific eLexicography for Africa" and the lexicographic database to be built thereof will be implemented with MySQL. The principles of the ISO model and of Spohr's model are adhered to with one major difference in the implementation strategy: we do not place the lemma in the centre of attention, but the sense description — all other elements, including the lemma, depend on the sense description. This article also describes the contained lexicographic data sets and how they have been collected from different sources. As our aim is to compile several prototypical internet dictionaries (a monolingual Northern Sotho dictionary, a bilingual learners' Xhosa–English dictionary and a bilingual Zulu–English dictionary), we describe the necessary microstructural elements for each of them and which principles we adhere to when designing different ways of accessing them. We plan to make the model and the (empty) database with all graphical user interfaces that have been developed, freely available by mid-2015.
We present a gold standard for semantic relation extraction in the food domain for German. The relation types that we address are motivated by scenarios for which IT applications present a commercial potential, such as virtual customer advice in which a virtual agent assists a customer in a supermarket in finding those products that satisfy their needs best. Moreover, we focus on those relation types that can be extracted from natural language text corpora, ideally content from the internet, such as web forums, that are easy to retrieve. A typical relation type that meets these requirements are pairs of food items that are usually consumed together. Such a relation type could be used by a virtual agent to suggest additional products available in a shop that would potentially complement the items a customer has already in their shopping cart. Our gold standard comprises structural data, i.e. relation tables, which encode relation instances. These tables are vital in order to evaluate natural language processing systems that extract those relations.
This paper argues that there is a correlation between functional and purely grammatical patterning in language, yet the nature of this correlation has to be explored. This claim is based on the results of a corpus-driven study of the Slavic aspect, drawing on the socalled Distributional Hypothesis. According to the East-West Theory of the Slavic aspect, there is a broad east-west isogloss dividing the Slavic languages into an eastern group and a western group. There are also two transitional zones in the north and south, which share some properties with each group (Dickey 2000; Barentsen 1998, 2008). The East-West Theory uses concepts of cognitive grammar such as totality and temporal definiteness, and is based on various parameters of aspectual usage in discourse, including contexts such as habituals, general factuals, historical (narrative) present, performatives, sequenced events in the past etc. The purpose of the above-mentioned study is to challenge the semantic approach to the Slavic aspect by comparing the perfective and imperfective verbal aspect on the basis of purely grammatical co-occurrence patterns (see also Janda & Lyashevskaya 2011). The study focused on three Slavic languages: Russian, which, following the East-West Theory, belongs to the eastern group, Czech, which belongs to the western group, and Polish, which is considered as transitional in its aspectual patterning.
We present a testsuite for POS tagging German web data. Our testsuite provides the original raw text as well as the gold tokenisations and is annotated for parts-of-speech. The testsuite includes a new dataset for German tweets, with a current size of 3,940 tokens. To increase the size of the data, we harmonised the annotations in already existing web corpora, based on the Stuttgart-Tübingen Tag Set. The current version of the corpus has an overall size of 48,344 tokens of web data, around half of it from Twitter. We also present experiments, showing how different experimental setups (training set size, additional out-of-domain training data, self-training) influence the accuracy of the taggers. All resources and models will be made publicly available to the research community.