Refine
Year of publication
Document Type
- Article (30)
- Part of a Book (21)
- Preprint (8)
- Conference Proceeding (6)
- Other (2)
- Working Paper (2)
- Book (1)
- Doctoral Thesis (1)
Keywords
- Korpus <Linguistik> (23)
- Deutsch (22)
- Sprachstatistik (19)
- Wortschatz (17)
- Computerunterstützte Lexikographie (12)
- COVID-19 (10)
- Lexikostatistik (9)
- Sprachwandel (9)
- Online-Medien (8)
- Vielfalt (8)
Publicationstate
- Veröffentlichungsversion (37)
- Zweitveröffentlichung (12)
- Postprint (6)
- Preprint (2)
Reviewstate
Publisher
- De Gruyter (10)
- de Gruyter (8)
- Leibniz-Institut für Deutsche Sprache (IDS) (7)
- Cornell University (4)
- IDS-Verlag (3)
- Institut für Deutsche Sprache (3)
- MDPI (3)
- Erich Schmidt (2)
- Oxford University Press (OUP) (2)
- Springer Nature (2)
A comparison between morphological complexity measures: typological data vs. language corpora
(2016)
Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.
We investigate the optional omission of the infinitival marker in a Swedish future tense construction. During the last two decades the frequency of omission has been rapidly increasing, and this process has received considerable attention in the literature. We test whether the knowledge which has been accumulated can yield accurate predictions of language variation and change. We extracted all occurrences of the construction from a very large collection of corpora. The dataset was automatically annotated with language-internal predictors which have previously been shown or hypothesized to affect the variation. We trained several models in order to make two kinds of predictions: whether the marker will be omitted in a specific utterance and how large the proportion of omissions will be for a given time period. For most of the approaches we tried, we were not able to achieve a better-than-baseline performance. The only exception was predicting the proportion of omissions using autoregressive integrated moving average models for one-step-ahead forecast, and in this case time was the only predictor that mattered. Our data suggest that most of the language-internal predictors do have some effect on the variation, but the effect is not strong enough to yield reliable predictions.
Metalinguistic awareness of standard vs standard usage. The case of determiners in spoken German
(2015)
Im vorliegenden Beitrag werden Ergebnisse aus zwei Benutzungsstudien präsentiert, die zum Wörterbuch elexiko im Januar bzw. März 2011 realisiert wurden. Wörterbuchbenutzungsforschung für ein neu konzipiertes, noch im Aufbau befindliches, umfangreiches Onlinewörterbuch zur deutschen Gegenwartssprache wie elexiko ist bislang nur in geringem Umfang durchgeführt worden.Dabei ist der Bedarf an Klärung der Benutzerbedürfnisse und -meinungen insgesamt groß. Solch eine Klärung kann einerseits als Bestätigung von Entscheidungen, die für Inhalt und Präsentation des Wörterbuchs getroffen wurden, dienen. Sie dient andererseits aber auch als Anregung für deren Verbesserung auf der Grundlage nicht vermeintlicher, sondern tatsächlicher Bedürfnisse und Meinungen zur Wörterbuchbenutzung.
In this paper, we present the concept and the results of two studies addressing (potential) users of monolingual German online dictionaries, such as www.elexiko.de. Drawing on the example of elexiko, the aim of those studies was to collect empirical data on possible extensions of the content of monolingual online dictionaries, e.g. the search function, to evaluate how users comprehend the terminology of the user interface, to find out which types of information are expected to be included in each specific lexicographic module and to investigate general questions regarding the function and reception of examples illustrating the use of a word. The design and distribution of the surveys is comparable to the studies described in the chapters 5-8 of this volume. We also explain, how the data obtained in our studies were used for further improvement of the elexiko-dictionary.
Der Beitrag stellt die Ergebnisse einer Onlinebenutzungsstudie zur Funktion und Rezeption von Belegen im einsprachigen deutschen Onlinewörterbuch elexiko vor. Diese werden vor dem Hintergrund allgemeiner metalexikographischer und konzeptioneller Überlegungen interpretiert, ein Ausblick führt zu weiteren relevanten Fragestellungen.
The Google Ngram Corpora seem to offer a unique opportunity to study linguistic and cultural change in quantitative terms. To avoid breaking any copyright laws, the data sets are not accompanied by any metadata regarding the texts the corpora consist of. Some of the consequences of this strategy are analyzed in this article. I chose the example of measuring censorship in Nazi Germany, which received widespread attention and was published in a paper that accompanied the release of the Google Ngram data (Michel et al. (2010): Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82). I show that without proper metadata, it is unclear whether the results actually reflect any kind of censorship at all. Collectively, the findings imply that observed changes in this period of time can only be linked directly to World War II to a certain extent. Therefore, instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’. On a more general level, the analysis demonstrates the importance of metadata, the availability of which is not just a nice add-on, but a powerful source of information for the digital humanities.
Recently, a claim was made, on the basis of the German Google Books 1-gram corpus (Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2010; 331: 176–82), that there was a linear relationship between six non-technical non-Nazi words and three ‘explicitly Nazi words’ in times of World War II (Caruana-Galizia. 2015. Politics and the German language: Testing Orwell’s hypothesis using the Google N-Gram corpus. Digital Scholarship in the Humanities [Online]. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqv011 (accessed 15 April 2015)). Here, I try to show that apparent relationships like this are the result of misspecified models that do not take into account the temporal aspect of time-series data. The main point of this article is to demonstrate why such analyses run the risk of incorrect statistical inference, where potential effects are both meaningless and can potentially lead to wrong conclusions.