Refine
Year of publication
- 2015 (101) (remove)
Document Type
- Conference Proceeding (38)
- Part of a Book (33)
- Article (20)
- Book (5)
- Working Paper (2)
- Master's Thesis (1)
- Preprint (1)
- Review (1)
Language
- English (101) (remove)
Keywords
- Korpus <Linguistik> (34)
- Deutsch (21)
- Computerlinguistik (12)
- Annotation (11)
- Englisch (10)
- Automatische Sprachanalyse (6)
- Corpus annotation (6)
- Corpus linguistics (6)
- Corpus technology (6)
- Datenbanksystem (6)
Publicationstate
- Veröffentlichungsversion (49)
- Postprint (11)
- Zweitveröffentlichung (9)
- Preprint (2)
Reviewstate
Publisher
- Institut für Deutsche Sprache (9)
- Springer (7)
- Narr (6)
- De Gruyter (4)
- Benjamins (3)
- Frank & Timme (3)
- Palgrave Macmillan (3)
- de Gruyter (3)
- Association for Computational Linguistics (2)
- Elsevier (2)
Based on specific linguistic landmarks in the speech signal, this study investigates pitch level and pitch span differences in English, German, Bulgarian and Polish. The analysis is based on 22 speakers per language (11 males and 11 females). Linear mixed models were computed that include various linguistic measures of pitch level and span, revealing characteristic differences across languages and between language groups. Pitch level appeared to have significantly higher values for the female speakers in the Slavic than the Germanic group. The male speakers showed slightly different results, with only the Polish speakers displaying significantly higher mean values for pitch level than the German males. Overall, the results show that the Slavic speakers tend to have a wider pitch span than the German speakers. But for the linguistic measure, namely for span between the initial peaks and the non-prominent valleys, we only find the difference between Polish and German speakers. We found a flatter intonation contour in German than in Polish, Bulgarian and English male and female speakers and differences in the frequency of the landmarks between languages. Concerning “speaker liveliness” we found that the speakers from the Slavic group are significantly livelier than the speakers from the Germanic group.
Ein integriertes Datenbank-, Such- und Tagging-Tool (IDaSTo) wird vorgestellt, das sich besonders für Variablenanalysen, für Paralleltexte und für diachronische Untersuchungen eignet. Relevante Kategorien bzw. Variablen können individuell definiert, Tags frei im Text und auf verschiedenen Wegen gesetzt und ihre Häufigkeiten in den verlinkten Statistiken direkt abgerufen werden.
The present thesis introduces KoralQuery, a protocol for the generic representation of queries to linguistic corpora. KoralQuery defines a set of types and operations which serve as abstract representations of linguistic entities and configurations. By combining these types and operations in a nested structure, the protocol may express linguistic structures of arbitrary complexity. It achieves a high degree of neutrality with regard to linguistic theory, as it provides flexible structures that allow for the setting of certain parameters to access several complementing and concurrent sources and layers of annotation on the same textual data. JSON-LD is used as a serialisation format for KoralQuery, which allows for the well-defined and normalised exchange of linguistic queries between query engines to promote their interoperability. The automatic translation of queries issued in any of three supported query languages to such KoralQuery serialisations is the second main contribution of this thesis. By employing the introduced translation module, query engines may also work independently of particular query languages, as their backend technology may rely entirely on the abstract KoralQuery representations of the queries. Thus, query engines may provide support for several query languages at once without any additional overhead. The original idea of a general format for the representation of linguistic queries comes from an initiative called Corpus Query Lingua Franca (CQLF), whose theoretic backbone and practical considerations are outlined in the first part of this thesis. This part also includes a brief survey of three typologically different corpus query languages, thus demonstrating their wide variety of features and defining the minimal target space of linguistic types and operations to be covered by KoralQuery.
The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus analysis systems, which lack a common protocol. In this paper, we present KoralQuery, a JSON-LD based general corpus query protocol, aiming to be independent of particular QLs, tasks and corpus formats. In addition to describing the system of types and operations that Koral- Query is built on, we exemplify the representation of corpus queries in the serialized format and illustrate use cases in the KorAP project.
Hierarchical predictive coding has been identified as a possible unifying principle of brain function, and recent work in cognitive neuroscience has examined how it may be affected by age–related changes. Using language comprehension as a test case, the present study aimed to dissociate age-related changes in prediction generation versus internal model adaptation following a prediction error. Event-related brain potentials (ERPs) were measured in a group of older adults (60–81 years; n = 40) as they read sentences of the form “The opposite of black is white/yellow/nice.” Replicating previous work in young adults, results showed a target-related P300 for the expected antonym (“white”; an effect assumed to reflect a prediction match), and a graded N400 effect for the two incongruous conditions (i.e. a larger N400 amplitude for the incongruous continuation not related to the expected antonym, “nice,” versus the incongruous associated condition, “yellow”). These effects were followed by a late positivity, again with a larger amplitude in the incongruous non-associated versus incongruous associated condition. Analyses using linear mixed-effects models showed that the target-related P300 effect and the N400 effect for the incongruous non-associated condition were both modulated by age, thus suggesting that age-related changes affect both prediction generation and model adaptation. However, effects of age were outweighed by the interindividual variability of ERP responses, as reflected in the high proportion of variance captured by the inclusion of by-condition random slopes for participants and items. We thus argue that – at both a neurophysiological and a functional level – the notion of general differences between language processing in young and older adults may only be of limited use, and that future research should seek to better understand the causes of interindividual variability in the ERP responses of older adults and its relation to cognitive performance.
One was a distinguished natural scientist and engineer, the other a self-taught scientist and vilified as a conman: Christian Gottlieb Kratzenstein (1723–1795) and Wolfgang von Kempelen (1734–1804). Some of the former’s postula-tions on human physiology and articulation of speech proved wrong in later years. Most of the latter’s theories are considered applicable even today. The perhaps most contrasting approaches to speech synthesis during the 18th century are linked to their names. There are many essential differences between their approaches which show that these two researchers were not only representatives of different schools of thought, but also representatives of two different scientific eras. A speculative and philosophical approach on the one hand versus an empirical and logical approach on the other hand. Both Kratzenstein and Kempelen published books on their research. But while the “Tentamen” [4] of the physician Kratzen-stein remains rather vague and imprecise in its descriptions of vowel production and synthesis, the “Mechanismus” [8] of the engineer Kempelen shows much more precision and correctness in almost every respect of human speech and lan-guage. The goal of this paper is to discuss the differences between these two con-temporaneous researchers on speech synthesis and to compare their theories with present-days findings.
With an increasing amount of text data available it is possible to automatically extract a variety of information about language. One way to obtain knowledge about subtle relations and analogies between words is to observe words which are used in the same context. Recently, Mikolov et al. proposed a method to efficiently compute Euclidean word representations which seem to capture subtle relations and analogies between words in the English language. We demonstrate that this method also captures analogies in the German language. Furthermore, we show that we can transfer information extracted from large non-annotated corpora into small annotated corpora, which are then, in turn, used for training NLP systems.
We investigate whether non-configurational languages, which display more word order variation than configurational ones, require more training data for a phenomenon to be parsed successfully. We perform a tightly controlled study comparing the dative alternation for English (a configurational language), German, and Russian (both non-configurational). More specifically, we compare the performance of a dependency parser when only canonical word order is present with its performance on data sets when all word orders are present. Our results show that for all languages, canonical data not only is easier to parse, but there exists no direct correspondence between the size of training sets containing free(er) word order variation and performance.
This paper aims at showing how quantitative corpus linguistic analysis can inform qualitative analysis of digital media discourse with respect to the mediality of language in use. Using the example of protest discourse in Twitter, in the field of anti-Islamic ‘Pegida’ demonstrations, a three-step method of collecting, reducing and interpreting salient data is proposed. Each step is aligned with operative medial features of the microblog: hashtags, retweets and @-interactions. The exemplary analysis reveals the importance of discussions of attendance numbers in protest discourse and the asymmetry between administrative (i.e. the police) and non-administrative discourse agents. Furthermore, it exemplifies how frequency analysis and sequence analysis can be combined for research in media linguistics.