Refine
Year of publication
Document Type
- Article (17)
- Part of a Book (11)
- Preprint (5)
- Other (2)
- Conference Proceeding (1)
- Doctoral Thesis (1)
Keywords
- Korpus <Linguistik> (12)
- Deutsch (11)
- Sprachstatistik (11)
- Computerunterstützte Lexikographie (8)
- Wortschatz (8)
- Benutzer (5)
- COVID-19 (5)
- Lexikostatistik (5)
- Online-Medien (5)
- Sprachwandel (5)
Publicationstate
- Veröffentlichungsversion (37) (remove)
Reviewstate
Publisher
- De Gruyter (8)
- Leibniz-Institut für Deutsche Sprache (IDS) (6)
- Cornell University (4)
- MDPI (3)
- de Gruyter (3)
- IDS-Verlag (2)
- Springer Nature (2)
- Buske (1)
- OSF Preprints, Center for Open Science (1)
- PLOS (1)
Der Beitrag stellt die Ergebnisse einer Onlinebenutzungsstudie zur Funktion und Rezeption von Belegen im einsprachigen deutschen Onlinewörterbuch elexiko vor. Diese werden vor dem Hintergrund allgemeiner metalexikographischer und konzeptioneller Überlegungen interpretiert, ein Ausblick führt zu weiteren relevanten Fragestellungen.
This chapter summarizes the typical steps of an empirical investigation. Every step is illustrated using examples from our research project into online dictionary use or other relevant studies. This chapter does not claim to contain anything new, but presents a brief guideline for lexicographical researchers who are interested in conducting their own empirical research.
In this paper, we present the concept and the results of two studies addressing (potential) users of monolingual German online dictionaries, such as www.elexiko.de. Drawing on the example of elexiko, the aim of those studies was to collect empirical data on possible extensions of the content of monolingual online dictionaries, e.g. the search function, to evaluate how users comprehend the terminology of the user interface, to find out which types of information are expected to be included in each specific lexicographic module and to investigate general questions regarding the function and reception of examples illustrating the use of a word. The design and distribution of the surveys is comparable to the studies described in the chapters 5-8 of this volume. We also explain, how the data obtained in our studies were used for further improvement of the elexiko-dictionary.
This chapter presents empirical findings on the question which criteria are making a good online dictionary using data on expectations and demands collected in the first study (N=684), completed with additional results from the second study (N=390) which examined more closely whether the respondents had differentiated views on individual aspects of the criteria rated in the first study. Our results show that the classical criteria of reference books (e.g. reliability, clarity) were rated highest by our participants, whereas the unique characteristics of online dictionaries (e.g. multimedia, adaptability) were rated and ranked as (partly) unimportant. To verify whether or not the poor rating of these innovative features was a result of the fact that the subjects are not used to online dictionaries incorporating those features, we integrated an experiment into the second study. Our results revealed a learning effect: Participants in the learning-effect condition, i. e. respondents who were first presented with examples of possible innovative features of online dictionaries,judged adaptability and multimedia to be more useful than participants who did not have this information. Thus, our data point to the conclusion that developing innovative features is worthwhile but that it is necessary to be aware of the fact that users can only be convinced of its benefits gradually.
In this paper, the authors use the 2012 log files of two German online dictionaries (Digital Dictionary of the German Language and the German Version of Wiktionary) and the 100,000 most frequent words in the Mannheim German Reference Corpus from 2009 to answer the question of whether dictionary users really do look up frequent words, first asked by de Schryver et al. (2006). By using an approach to the comparison of log files and corpus data which is completely different from that of the aforementioned authors, we provide empirical evidence that indicates - contrary to the results of de Schryver et al. and Verlinde/Binon (2010) - that the corpus frequency of a word can indeed be an important factor in determining what online dictionary users look up. Finally, we incorporate word class Information readily available in Wiktionary into our analysis to improve our results considerably.
The main aim of the study presented in this chapter was to try out eyetracking as form to collect data about dictionary use as it is – for research into dictionary use – a new and not widely used technology. As the topic of research, we decided to evaluate the new web design of the IDS dictionary portal OWID. In the mid of 2011 where the study was conducted, the relaunch of the web design was internally finished but externally not released yet. In this regard, it was a good time to see whether users get along well with the new design decisions. 38 persons participated in our study, all of them students aged 20-30 years. Besides the results the chapter also includes critical comments on methodological aspects of our study.
Questions of design
(2014)
All lexicographers working on online dictionary projects that do not wish to use an established form of design for their online dictionary, or simply have new kinds of lexicographic data to present, face the problem of what kind of arrangement is best suited for the intended users of the dictionary. In this chapter, we present data about questions relating to the design of online dictionaries. This will provide projects that use these or similar ways of presenting their lexicographic data with valuable information about how potential dictionary users assess and evaluate them. In addition, the answers to corresponding open-ended questions show, detached from concrete design models, which criteria potential users value in a good online representation. Clarity and an uncluttered look seem to dominate in many answers, as well as the possibility of customization, if the latter is not connected with a too complex usability model.
The first international study (N=684) we conducted within our research project on online dictionary use included very general questions on that topic. In this chapter, we present the corresponding results on questions like the use of both printed and online dictionaries as well as on the types of dictionaries used, devices used to access online dictionaries and some information regarding the willingness to pay for premium content. The data collected by us, show that our respondents both use printed and online dictionaries and, according to their self-report, many different kinds of dictionaries. In this context, our results revealed some clear cultural differences: in German-speaking areas spelling dictionaries are more common than in other linguistic areas, where thesauruses are widespread. Only a minority of our respondents is willing to pay for premium content, but most of the respondents are prepared to accept advertising. Our results also demonstrate that our respondents mainly tend to use dictionaries on big-screen devices, e.g. desktop computers or laptops.
This article presents empirical findings about what criteria make for a good online dictionary, using data on expectations and demands collected in an online questionnaire (N~684), complemented by additional results from a second questionnaire (N-390) which looked more closely at whether respondents had differentiated views on individual aspects of the criteria rated in the first study. Our results show that the classical criteria of reference books (such as reliability and clarity) were rated highest by our participants, whereas the unique characteristics of online dictionaries (such as multimedia and adaptability) were rated and ranked as (partly) unimportant. To verify whether or not the poor ratings of these innovative features were a result of the fact that our subjects are unfamiliar with online dictionaries incorporating such features, we incorporated an experiment into the second study. Our results revealed a learning effect: participants in the learning-effect condition, i.e. respondents who were first presented with examples of possible innovative features of online dictionaries, judged adaptability and multimedia to be more useful than participants who were not given that information. Thus, our data point to the conclusion that developing innovative features is worthwhile but that it should be borne in mind that users can only be persuaded of their benefits gradually. In addition, we present data about questions relating to the design of online dictionaries.
We present studies using the 2013 log files from the German version of Wiktionary. We investigate several lexicographically relevant variables and their effect on look-up frequency: Corpus frequency of the headword seems to have a strong effect on the number of visits to a Wiktionary entry. We then consider the question of whether polysemic words are looked up more often than monosemic ones. Here, we also have to take into account that polysemic words are more frequent in most languages. Finally, we present a technique to investigate the time-course of look-up behaviour for specific entries. We exemplify the method by investigating influences of (temporary) social relevance of specific headwords.
In order to demonstrate why it is important to correctly account for the (serial dependent) structure of temporal data, we document an apparently spectacular relationship between population size and lexical diversity: for five out of seven investigated languages, there is a strong relationship between population size and lexical diversity of the primary language in this country. We show that this relationship is the result of a misspecified model that does not consider the temporal aspect of the data by presenting a similar but nonsensical relationship between the global annual mean sea level and lexical diversity. Given the fact that in the recent past, several studies were published that present surprising links between different economic, cultural, political and (socio-)demographical variables on the one hand and cultural or linguistic characteristics on the other hand, but seem to suffer from exactly this problem, we explain the cause of the misspecification and show that it has profound consequences. We demonstrate how simple transformation of the time series can often solve problems of this type and argue that the evaluation of the plausibility of a relationship is important in this context. We hope that our paper will help both researchers and reviewers to understand why it is important to use special models for the analysis of data with a natural temporal ordering.
This thesis consists of the following three papers that all have been published in international peer-reviewed journals:
Chapter 3: Koplenig, Alexander (2015c). The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv037]
Chapter 4: Koplenig, Alexander (2015b). Why the quantitative analysis of dia-chronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv030]
Chapter 5: Koplenig, Alexander (2015a). Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis. Published in: Corpus Linguistics and Linguistic Theory. Berlin/Boston: de Gruyter. [doi:10.1515/cllt-2014-0049]
Chapter 1 introduces the topic by describing and discussing several basic concepts relevant to the statistical analysis of corpus linguistic data. Chapter 2 presents a method to analyze diachronic corpus data and a summary of the three publications. Chapters 3 to 5 each represent one of the three publications. All papers are printed in this thesis with the permission of the publishers.
This paper explores speakers’ notions of the situational appropriacy of linguistic variants. We conducted a web-based survey in which we collected ratings of the appropriacy of variants of linguistic variables in spoken German. A range of quantitative methods (cluster analysis, factor analysis and various forms of visualization techniques) is applied in order to analyze metalinguistic awareness and the differences in the evaluation of written vs. spoken stimuli. First, our data show that speakers’ ratings of the appropriacy of linguistic variants vary reliably with two rough clusters representing formal and informal speech situations and genres. The findings confirm that speakers adhere to a notion of spoken standard German which takes genre and register-related variation into account. Secondly, our analysis reveals a written language bias: metalinguistic awareness is strongly influenced by the physical mode of the presentation of linguistic items (spoken vs. written).
Languages employ different strategies to transmit structural and grammatical information. While, for example, grammatical dependency relationships in sentences are mainly conveyed by the ordering of the words for languages like Mandarin Chinese, or Vietnamese, the word ordering is much less restricted for languages such as Inupiatun or Quechua, as these languages (also) use the internal structure of words (e.g. inflectional morphology) to mark grammatical relationships in a sentence. Based on a quantitative analysis of more than 1,500 unique translations of different books of the Bible in almost 1,200 different languages that are spoken as a native language by approximately 6 billion people (more than 80% of the world population), we present large-scale evidence for a statistical trade-off between the amount of information conveyed by the ordering of words and the amount of information conveyed by internal word structure: languages that rely more strongly on word order information tend to rely less on word structure information and vice versa. Or put differently, if less information is carried within the word, more information has to be spread among words in order to communicate successfully. In addition, we find that–despite differences in the way information is expressed–there is also evidence for a trade-off between different books of the biblical canon that recurs with little variation across languages: the more informative the word order of the book, the less informative its word structure and vice versa. We argue that this might suggest that, on the one hand, languages encode information in very different (but efficient) ways. On the other hand, content-related and stylistic features are statistically encoded in very similar ways.
In the first volume of Corpus Linguistics and Linguistic Theory, Gries (2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2). doi:10.1515/ cllt.2005.1.2.277. http://www.degruyter.com/view/j/cllt.2005.1.issue-2/cllt.2005. 1.2.277/cllt.2005.1.2.277.xml: 285) asked whether corpus linguists should abandon null-hypothesis significance testing. In this paper, I want to revive this discussion by defending the argument that the assumptions that allow inferences about a given population – in this case about the studied languages – based on results observed in a sample – in this case a collection of naturally occurring language data – are not fulfilled. As a consequence, corpus linguists should indeed abandon null-hypothesis significance testing.
Quantitativ ausgerichtete empirische Linguistik hat in der Regel das Ziel, grose Mengen sprachlichen Materials auf einmal in den Blick zu nehmen und durch geeignete Analysemethoden sowohl neue Phanomene zu entdecken als auch bekannte Phanomene systematischer zu erforschen. Das Ziel unseres Beitrags ist es, anhand zweier exemplarischer Forschungsfragen methodisch zu reflektieren, wo der quantitativ-empirische Ansatz fur die Analyse lexikalischer Daten wirklich so funktioniert wie erhofft und wo vielleicht sogar systembedingte Grenzen liegen. Wir greifen zu diesem Zweck zwei sehr unterschiedliche Forschungsfragen heraus: zum einen die zeitnahe Analyse von produktiven Wortschatzwandelprozessen und zum anderen die Ausgleichsbeziehung von Wortstellungsvs. Wortstrukturregularitat in den Sprachen der Welt. Diese beiden Forschungsfragen liegen auf sehr unterschiedlichen Abstraktionsebenen. Wir hoffen aber, dass wir mit ihnen in groser Bandbreite zeigen konnen, auf welchen Ebenen die quantitative Analyse lexikalischer Daten stattfinden kann. Daruber hinaus mochten wir anhand dieser sehr unterschiedlichen Analysen die Moglichkeiten und Grenzen des quantitativen Ansatzes reflektieren und damit die Interpretationskraft der Verfahren verdeutlichen.
Classical null hypothesis significance tests are not appropriate in corpus linguistics, because the randomness assumption underlying these testing procedures is not fulfilled. Nevertheless, there are numerous scenarios where it would be beneficial to have some kind of test in order to judge the relevance of a result (e.g. a difference between two corpora) by answering the question whether the attribute of interest is pronounced enough to warrant the conclusion that it is substantial and not due to chance. In this paper, I outline such a test.
Large-scale empirical evidence indicates a fascinating statistical relationship between the estimated number of language users and its linguistic and statistical structure. In this context, the linguistic niche hypothesis argues that this relationship reflects a negative selection against morphological paradigms that are hard to learn for adults, because languages with a large number of speakers are assumed to be typically spoken and learned by greater proportions of adults. In this paper, this conjecture is tested empirically for more than 2000 languages. The results question the idea of the impact of non-native speakers on the grammatical and statistical structure of languages, as it is demonstrated that the relative proportion of non-native speakers does not significantly correlate with either morphological or information-theoretic complexity. While it thus seems that large numbers of adult learners/speakers do not affect the (grammatical or statistical) structure of a language, the results suggest that there is indeed a relationship between the number of speakers and (especially) information-theoretic complexity, i.e. entropy rates. A potential explanation for the observed relationship is discussed.
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
(2019)
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.