Refine
Year of publication
Document Type
- Article (39) (remove)
Has Fulltext
- yes (39)
Keywords
- Deutsch (17)
- Korpus <Linguistik> (10)
- Wörterbuch (9)
- Wortschatz (8)
- COVID-19 (6)
- Lexikostatistik (5)
- Sprachstatistik (5)
- Worthäufigkeit (5)
- Computerunterstützte Lexikographie (4)
- Datenanalyse (4)
Publicationstate
- Veröffentlichungsversion (27)
- Zweitveröffentlichung (11)
- Postprint (8)
Reviewstate
- Peer-Review (31)
- (Verlags)-Lektorat (5)
Publisher
- Leibniz-Institut für Deutsche Sprache (IDS) (4)
- Oxford University Press (4)
- Erich Schmidt (3)
- Buro van die WAT (2)
- MDPI (2)
- Springer (2)
- Springer Nature (2)
- Wiley (2)
- de Gruyter (2)
- Benjamins (1)
One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective.
This paper presents the results of a survey on dictionary use in Europe, the largest survey of dictionary use to date with nearly 10,000 participants in nearly thirty countries. The paper focuses on the comparison of the results of the Slovenian participants with the results of the participants from other European countries. The comparisons are made both with the European averages, and with the results from individual countries, in order to determine in which aspects Slovenian participants share similarities with other dictionary users (and non-users) around Europe, and in which aspects they differ. The findings show that in many ways the Slovenian users are similar to their European counterparts, with some noticeable exceptions, including (much) stronger preference for digital dictionaries over print ones, above-average reliance on other people when dictionary does not contain the relevant information, and the largest difference between the price of a dictionary and the amount willing to spend on it.
Der Anlass dieser Untersuchung war zunächst anekdotische Evidenz: Eines der Kinder der Autor*innen macht 2022 Abitur und las in ihrer gesamten gymnasialen Laufbahn genau eine ›Ganzschrift‹ einer Autorin: Die Judenbuche von Annette von Droste-Hülshoff. Zweifellos ein lesenswerter Text, aber konnte es wirklich sein, dass man in Deutschland 2022 Abitur macht, sogar Deutsch-Leistungskurs gewählt hat und sonst kein Buch einer Autorin im Deutschunterricht liest? Auch in den Pflichtlektüren für das Deutschabitur ist im entsprechenden Bundesland bei den empfohlenen Texten kein Roman und kein Drama einer Verfasserin verzeichnet. Neugierig geworden, recherchierten wir nach einer Liste, welche Literatur für den Deutschunterricht an Gymnasien in Baden-Württemberg (wo die Anekdote sich ereignete) insgesamt empfohlen wurde, und fanden auf den Seiten des Kultusministeriums eine umfangreiche Liste, auf der 298 Werke verzeichnet sind. Eine Auswertung nach dem Geschlecht der Verfasser*innen ergab, dass von den Einträgen auf dieser Liste 31 Titel bzw. Autor*innen (von) Frauen sind, d.h. rund 10 %.
This contribution explores the relationship between the English CEFR (Common European Framework of Reference for Languages) vocabulary levels and user interest in English Wiktionary entries. User interest was operationalized through the number of views of these entries in Wikimedia server logs covering a period of four years (2019–2022). Our findings reveal a significant relationship between CEFR levels and user interest: entries classified at lower CEFR levels tend to attract more views, which suggests a greater user interest in more basic vocabulary. A multiple regression model controlling for other known or potential factors affecting interest: corpus frequency, polysemy, word prevalence, and age of acquisition confirmed that lower CEFR levels attract significantly more views even after taking into account the other predictors. These findings highlight the importance of CEFR levels in predicting which words users are likely to look up, with implications for lexicography and the development of language learning materials.
Are borrowed neologisms accepted more slowly into the German language than German words resulting from the application of word formation rules? This study addresses this question by focusing on two possible indicators for the acceptance of neologisms: a) frequency development of 239 German neologisms from the 1990s (loanwords as well as new words resulting from the application of word formation rules) in the German reference corpus DeReKo and b) frequency development in the use of pragmatic markers (‘flags’, namely quotation marks and phrases such as sogenannt ‘so-called’) with these words. In the second part of the article, a psycholinguistic approach to evaluating the (psychological) status of different neologisms and non-words in an experimentally controlled study and plans to carry out interviews in a field test to collect speakers’ opinions on the acceptance of the analysed neologisms are outlined. Finally, implications for the lexicographic treatment of both types of neologisms are discussed.
Are borrowed neologisms accepted more slowly into the German language than German words resulting from the application of wrd formation rules? This study addresses this question by focusing on two possible indicators for the acceptance of neologisms: a) frequency development of 239 German neologisms from the 1990s (loanwords as well as new words resulting from the application of word formation rules) in the German reference corpus DEREKO and b) frequency development in the use of pragmatic markers (‘flags’, namely quotation marks and phrases such as sogenannt ‘so-called’) with these words. In the second part of the article, a psycholinguistic approach to evaluating the (psychological) status of different neologisms and non-words in an experimentally controlled study and plans to carry out interviews in a field test to collect speakers’ opinions on the acceptance of the analysed neologisms are outlined. Finally, implications for the lexicographic treatment of both types of neologisms are discussed.
In the past two decades, more and more dictionary usage studies have been published, but most of them deal with questions related to what users appreciate about dictionaries, which dictionaries they use and what type of information they need in specific situations — presupposing that users actually consult lexicographic resources. However, language teachers and lecturers in linguistics often have the impression that students do not use enough high-quality dictionaries in their everyday work. With this in mind, we launched an international cooperation project to collect empirical data to evaluate what it is that students actually do while attempting to solve language problems. To this end, we applied a new methodological setting: screen recording in conjunction with a thinking-aloud task. The collected empirical data offers a broad insight into what users really do while they attempt to solve language-related tasks online.
cOWIDplus Analyse ist eine kontinuierlich aktualisierte Ressource zu der Frage, ob und wie stark sich der Wortschatz ausgewählter deutscher Online-Pressemeldungen während der Corona-Pandemie systematisch einschränkt und ob bzw. wann sich das Vokabular nach der Krise wieder ausweitet. In diesem Artikel erläutern die Autor*innen die hinter der Ressource stehende Forschungsfrage, die zugrunde gelegten Daten, die Methode sowie die bisherigen Ergebnisse.
Der Beitrag widmet sich den Geflüchteten als Teil der deutschlernenden Teilnehmer/innen in den staatlich verordneten Integrationskursen (IKs). Unsere Erhebung unter 305 Geflüchteten aus Syrien und anderen Ländern legt ihren Schwerpunkt auf die sprachlichen Hintergründe. Dabei werden soziodemografische Daten mit Angaben zum Spracherwerb in Beziehung gesetzt und als kollektive Sprachbiografien dargestellt. Des Weiteren beschreiben wir sieben Teilnehmergruppen von Geflüchteten in den IKs, die sich vor allem auf Grund der Faktoren Alter, Bildungsgrad und Arbeitserfahrung unterscheiden, für die aber auch Merkmale im Hinblick auf Herkunft und Mehrsprachigkeit eine Rolle spielen. Ferner werden Angaben zur Sozialsituation in Deutschland mit Einschätzungen zum Deutscherwerb in Beziehung gesetzt. Ein Vergleich mit anderen Studien verdeutlicht die Verschiebungen in der Zusammensetzung des IK. Unser Beitrag kann als Anregung verstanden werden, die Passgenauigkeit im Sinne der Deutschlernenden zu überdenken.
Eine europaweite Umfrage zu Wörterbuchbenutzung und -kultur. Ergebnisse der deutschen Teilnehmenden
(2018)
Gebrauchsgegenstand, Streitschlichter, Spielzeug, Nationalsymbol, Arbeitshilfe oder doch nur etwas, für das sich hauptsächlich Akademikerinnen und Akademiker interessieren? Welche Rolle spielen einsprachige Wörterbücher heute? Um unter anderen diesen Fragen nachzugehen, koordinierten wir gemeinsam mit Iztok Kosem (Universität Ljubljana) und Robert Lew (Adam-Mickiewicz Universität Poznań) die bis dato größte europaweite Umfrage zur Wörterbuchbenutzung und -kultur. Gemeinsam mit 26 ‚lokalen‘ Partnerinnen und Partnern aus ganz Europa führten wir im Rahmen des European Network of e-Lexicography (ENeL) diese Umfrage durch. Die Ergebnisse der Studie versprechen neue Einsichten in den gesellschaftlichen Status von Wörterbüchern in vielen europäischen Ländern. Durch die möglichst parallele Erhebung der Daten in den teilnehmenden Ländern werden außerdem interessante Vergleiche der lokalen ‚Wörterbuchkulturen‘ möglich sein. Im Fokus der Befragung standen allgemeine einsprachige Wörterbücher in der oder den jeweiligen Landessprache(n).
This article examines the contrasts and commonalities between languages for specific purposes (LSP) and their popularizations on the one hand and the frequency patterns of LSP register features in English and German on the other. For this purpose corpora of expertexpert and expert-lay communication are annotated for part-of-speech and phrase structure information. On this basis, the frequencies of pre- and post-modifications in complex noun phrases are statistically investigated and compared for English and German. Moreover, using parallel and comparable corpora it is tested whether English-German translations obey the register norms of the target language or whether the LSP frequency patterns of the source language Ñshine throughì. The results provide an empirical insight into language contact phenomena involving specialized communication.
Wiktionary is increasingly gaining influence in a wide variety of linguistic fields such as NLP and lexicography, and has great potential to become a serious competitor for publisher-based and academic dictionaries. However, little is known about the "crowd" that is responsible for the content of Wiktionary. In this article, we want to shed some light on selected questions concerning large-scale cooperative work in online dictionaries. To this end, we use quantitative analyses of the complete edit history files of the English and German Wiktionary language editions. Concerning the distribution of revisions over users, we show that — compared to the overall user base — only very few authors are responsible for the vast majority of revisions in the two Wiktionary editions. In the next step, we compare this distribution to the distribution of revisions over all the articles. The articles are subsequently analysed in terms of rigour and diversity, typical revision patterns through time, and novelty (the time since the last revision). We close with an examination of the relationship between corpus frequencies of headwords in articles, the number of article visits, and the number of revisions made to articles.
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
This replication study aims to investigate a potential bias toward addition in the German language, building upon previous findings of Winter and colleagues who identified a similar bias in English. Our results confirm a bias in word frequencies and binomial expressions, aligning with these previous findings. However, the analysis of distributional semantics based on word vectors did not yield consistent results for German. Furthermore, our study emphasizes the crucial role of selecting appropriate translational equivalents, highlighting the significance of considering language-specific factors when testing for such biases for languages other than English.
Based on the privative derivational suffix -los, we test statements found in the literature on word formation using a – at least in this field – novel empirical basis: a list of affective-emotional ratings of base nouns and associated -los derivations. In addition to a frequency analysis based on the German Reference Corpus, we show that, in general, emotional polarity (so-called valence, positive vs. negative emotions) is reversed by suffixation with -los. This change is stronger for more polarized base nouns. The perceived intensity of emotion (so-called arousal) is generally lower for -los derivations than for base nouns. Finally, to capture the results theoretically, we propose a prototypical -los construction in the framework of Construction Morphology.
Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ChatGPT chatbot, show impressive performance on a wide range of linguistic tasks, thus providing cognitive science and linguistics with a computational working model to empirically study different aspects of human language. Here, we use LMs to test the hypothesis that languages with more speakers tend to be easier to learn. In two experiments, we train several LMs—ranging from very simple n-gram models to state-of-the-art deep neural networks—on written cross-linguistic corpus data covering 1293 different languages and statistically estimate learning difficulty. Using a variety of quantitative methods and machine learning techniques to account for phylogenetic relatedness and geographical proximity of languages, we show that there is robust evidence for a relationship between learning difficulty and speaker population size. However, contrary to expectations derived from previous research, our results suggest that languages with more speakers tend to be harder to learn.
Im vorliegenden Beitrag gehen wir von der Prämisse aus, dass die Angemessenheit sprachlicher Formen nicht pauschal, sondern anhand des jeweiligen Kontexts zu beurteilen ist. Anhand einer Online-Fragebogenstudie mit durch weil eingeleiteten Nebensätzen untersuchen wir die Hypothese, dass Varianten, die nicht dem Schriftstandard entsprechen, in Kommunikationsformen, die sich weniger an standard- und schriftsprachlichen Normen orientieren, als (mindestens) ebenso angemessen oder zumindest unterschiedlich wahrgenommen werden wie eine schriftstandardsprachliche Variante. Wir untersuchen dies anhand von drei Aufgaben: Rezeption, Produktion und Assoziation zu bestimmten Medien und Textsorten. Wir können zeigen, dass die schriftnormgerechte Variante durchweg als am akzeptabelsten eingeschätzt wird. In allen drei Aufgaben finden sich aber auch eindeutige und übereinstimmende Effekte, die nahelegen, dass die verschiedenen Varianten in Abhängigkeit der Textsorte doch unterschiedlich eingeschätzt, produziert und assoziiert werden.
Dictionary usage research views dictionaries primarily as tools for solving linguistic problems. A large proportion of dictionary use now takes place online and can thus be easily monitored using tracking technologies. Using the data gathered through tracking usage data, we hope to optimize user experiences of dictionaries and other linguistic resources. Usage statistics are also used for external evaluation of linguistic resources. In this paper, we pursue the following three questions from a quantitative perspective: (1) What new insights can we gain from collecting and analysing usage data? (2) What limitations of the data and/or the collection process do we need to be aware of? (3) How can these insights and limitations inform the development and evaluation of linguistic resources?