Refine
Year of publication
Document Type
- Article (25) (remove)
Has Fulltext
- yes (25)
Is part of the Bibliography
- yes (25) (remove)
Keywords
- Deutsch (7)
- Korpus <Linguistik> (4)
- Sprachstatistik (4)
- Wortschatz (4)
- Benutzer (3)
- COVID-19 (3)
- Computerunterstützte Lexikographie (3)
- Lexikostatistik (3)
- Online-Medien (3)
- Vielfalt (3)
Publicationstate
- Veröffentlichungsversion (25) (remove)
Reviewstate
- Peer-Review (14)
- (Verlags)-Lektorat (9)
Publisher
In many countries of the world, perspectives on gender equality and racism have changed in recent decades. One result has been more attention being devoted to traces of androcentric and racist language in society. This also affects dictionaries. In lexicography there are discussions about whether or to what extent social asymmetries are inscribed in dictionaries and if this is still acceptable. The issue of the nature of description plays an important role in this discussion. If sexist usages are often found in language use, i.e. in the corpus data on which the dictionary is based, does the dictionary also have to show them? How is this, in turn, compatible with the normative power of dictionaries? Do dictionaries contribute to the perpetuation of gender stereotypes by showcasing them under the banner of descriptive principles? And what roles do lexicographers play in this process? The article deals with these questions on the basis of individual lexicographical examples and current discussions in the lexicographic and public community.
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
Der Anlass dieser Untersuchung war zunächst anekdotische Evidenz: Eines der Kinder der Autor*innen macht 2022 Abitur und las in ihrer gesamten gymnasialen Laufbahn genau eine ›Ganzschrift‹ einer Autorin: Die Judenbuche von Annette von Droste-Hülshoff. Zweifellos ein lesenswerter Text, aber konnte es wirklich sein, dass man in Deutschland 2022 Abitur macht, sogar Deutsch-Leistungskurs gewählt hat und sonst kein Buch einer Autorin im Deutschunterricht liest? Auch in den Pflichtlektüren für das Deutschabitur ist im entsprechenden Bundesland bei den empfohlenen Texten kein Roman und kein Drama einer Verfasserin verzeichnet. Neugierig geworden, recherchierten wir nach einer Liste, welche Literatur für den Deutschunterricht an Gymnasien in Baden-Württemberg (wo die Anekdote sich ereignete) insgesamt empfohlen wurde, und fanden auf den Seiten des Kultusministeriums eine umfangreiche Liste, auf der 298 Werke verzeichnet sind. Eine Auswertung nach dem Geschlecht der Verfasser*innen ergab, dass von den Einträgen auf dieser Liste 31 Titel bzw. Autor*innen (von) Frauen sind, d.h. rund 10 %.
Im vorliegenden Beitrag gehen wir von der Prämisse aus, dass die Angemessenheit sprachlicher Formen nicht pauschal, sondern anhand des jeweiligen Kontexts zu beurteilen ist. Anhand einer Online-Fragebogenstudie mit durch weil eingeleiteten Nebensätzen untersuchen wir die Hypothese, dass Varianten, die nicht dem Schriftstandard entsprechen, in Kommunikationsformen, die sich weniger an standard- und schriftsprachlichen Normen orientieren, als (mindestens) ebenso angemessen oder zumindest unterschiedlich wahrgenommen werden wie eine schriftstandardsprachliche Variante. Wir untersuchen dies anhand von drei Aufgaben: Rezeption, Produktion und Assoziation zu bestimmten Medien und Textsorten. Wir können zeigen, dass die schriftnormgerechte Variante durchweg als am akzeptabelsten eingeschätzt wird. In allen drei Aufgaben finden sich aber auch eindeutige und übereinstimmende Effekte, die nahelegen, dass die verschiedenen Varianten in Abhängigkeit der Textsorte doch unterschiedlich eingeschätzt, produziert und assoziiert werden.
Das Kommunizieren in Sozialen Medien und der Umgang mit Hypertexten ist im Jahr 2020 kein Randphänomen mehr. Die sprachlichen Besonderheiten internetbasierter Kommunikation und Sozialer Medien sind mittlerweile auch gut erforscht und beschrieben, allerdings werden diese bislang in deutschen Grammatiken, mit Ausnahme von Hoffmann (2014), allenfalls am Rande behandelt. Selbst neuere Ansätze zur Textanalyse, z. B. Ágel (2017), konzentrieren sich auf gestaltstabile, linear organisierte Schrifttexte. Dasselbe gilt für Ansätze, die primär für die Bewertung von Schreibprodukten in Bildungskontexten entwickelt wurden.
cOWIDplus Analyse ist eine kontinuierlich aktualisierte Ressource zu der Frage, ob und wie stark sich der Wortschatz ausgewählter deutscher Online-Pressemeldungen während der Corona-Pandemie systematisch einschränkt und ob bzw. wann sich das Vokabular nach der Krise wieder ausweitet. In diesem Artikel erläutern die Autor*innen die hinter der Ressource stehende Forschungsfrage, die zugrunde gelegten Daten, die Methode sowie die bisherigen Ergebnisse.
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
(2019)
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
This paper presents the results of a survey on dictionary use in Europe, the largest survey of dictionary use to date with nearly 10,000 participants in nearly thirty countries. The paper focuses on the comparison of the results of the Slovenian participants with the results of the participants from other European countries. The comparisons are made both with the European averages, and with the results from individual countries, in order to determine in which aspects Slovenian participants share similarities with other dictionary users (and non-users) around Europe, and in which aspects they differ. The findings show that in many ways the Slovenian users are similar to their European counterparts, with some noticeable exceptions, including (much) stronger preference for digital dictionaries over print ones, above-average reliance on other people when dictionary does not contain the relevant information, and the largest difference between the price of a dictionary and the amount willing to spend on it.
We present ESDexplorer (https://owid.shinyapps.io/ESDexplorer), a browser application which allows the user to explore the data from a large European survey on dictionary use and culture. We built ESDexplorer with several target groups in mind: our cooperation partners, other researchers, and a more general public interested in the results. Also, we present in detail the architecture and technological realisation of the application and discuss some legal aspects of data protection that motivated some architectural choices.
In the past two decades, more and more dictionary usage studies have been published, but most of them deal with questions related to what users appreciate about dictionaries, which dictionaries they use and what type of information they need in specific situations — presupposing that users actually consult lexicographic resources. However, language teachers and lecturers in linguistics often have the impression that students do not use enough high-quality dictionaries in their everyday work. With this in mind, we launched an international cooperation project to collect empirical data to evaluate what it is that students actually do while attempting to solve language problems. To this end, we applied a new methodological setting: screen recording in conjunction with a thinking-aloud task. The collected empirical data offers a broad insight into what users really do while they attempt to solve language-related tasks online.
Eine europaweite Umfrage zu Wörterbuchbenutzung und -kultur. Ergebnisse der deutschen Teilnehmenden
(2018)
Gebrauchsgegenstand, Streitschlichter, Spielzeug, Nationalsymbol, Arbeitshilfe oder doch nur etwas, für das sich hauptsächlich Akademikerinnen und Akademiker interessieren? Welche Rolle spielen einsprachige Wörterbücher heute? Um unter anderen diesen Fragen nachzugehen, koordinierten wir gemeinsam mit Iztok Kosem (Universität Ljubljana) und Robert Lew (Adam-Mickiewicz Universität Poznań) die bis dato größte europaweite Umfrage zur Wörterbuchbenutzung und -kultur. Gemeinsam mit 26 ‚lokalen‘ Partnerinnen und Partnern aus ganz Europa führten wir im Rahmen des European Network of e-Lexicography (ENeL) diese Umfrage durch. Die Ergebnisse der Studie versprechen neue Einsichten in den gesellschaftlichen Status von Wörterbüchern in vielen europäischen Ländern. Durch die möglichst parallele Erhebung der Daten in den teilnehmenden Ländern werden außerdem interessante Vergleiche der lokalen ‚Wörterbuchkulturen‘ möglich sein. Im Fokus der Befragung standen allgemeine einsprachige Wörterbücher in der oder den jeweiligen Landessprache(n).
Languages employ different strategies to transmit structural and grammatical information. While, for example, grammatical dependency relationships in sentences are mainly conveyed by the ordering of the words for languages like Mandarin Chinese, or Vietnamese, the word ordering is much less restricted for languages such as Inupiatun or Quechua, as these languages (also) use the internal structure of words (e.g. inflectional morphology) to mark grammatical relationships in a sentence. Based on a quantitative analysis of more than 1,500 unique translations of different books of the Bible in almost 1,200 different languages that are spoken as a native language by approximately 6 billion people (more than 80% of the world population), we present large-scale evidence for a statistical trade-off between the amount of information conveyed by the ordering of words and the amount of information conveyed by internal word structure: languages that rely more strongly on word order information tend to rely less on word structure information and vice versa. Or put differently, if less information is carried within the word, more information has to be spread among words in order to communicate successfully. In addition, we find that–despite differences in the way information is expressed–there is also evidence for a trade-off between different books of the biblical canon that recurs with little variation across languages: the more informative the word order of the book, the less informative its word structure and vice versa. We argue that this might suggest that, on the one hand, languages encode information in very different (but efficient) ways. On the other hand, content-related and stylistic features are statistically encoded in very similar ways.
In order to demonstrate why it is important to correctly account for the (serial dependent) structure of temporal data, we document an apparently spectacular relationship between population size and lexical diversity: for five out of seven investigated languages, there is a strong relationship between population size and lexical diversity of the primary language in this country. We show that this relationship is the result of a misspecified model that does not consider the temporal aspect of the data by presenting a similar but nonsensical relationship between the global annual mean sea level and lexical diversity. Given the fact that in the recent past, several studies were published that present surprising links between different economic, cultural, political and (socio-)demographical variables on the one hand and cultural or linguistic characteristics on the other hand, but seem to suffer from exactly this problem, we explain the cause of the misspecification and show that it has profound consequences. We demonstrate how simple transformation of the time series can often solve problems of this type and argue that the evaluation of the plausibility of a relationship is important in this context. We hope that our paper will help both researchers and reviewers to understand why it is important to use special models for the analysis of data with a natural temporal ordering.
In this contribution, we present a novel approach for the analysis of cross-reference structures in digital dictionaries on the basis of the complete dictionary database. Using paradigmatic items in the German Wiktionary as an example, we show how analyses based on graph theory can be fruitfully applied in this context, e. g. to gain an overview of paradigmatic references as a whole or to detect closely connected groups of headwords. Furthermore, we connect information about cross-reference structures with corpus frequencies and log file statistics. In this way, we can answer questions such as the following ones: Are frequent words paradigmatically linked more closely than others? Are closely linked headwords or headwords that stand more solitary in the dictionary visited significantly more often?
We present studies using the 2013 log files from the German version of Wiktionary. We investigate several lexicographically relevant variables and their effect on look-up frequency: Corpus frequency of the headword seems to have a strong effect on the number of visits to a Wiktionary entry. We then consider the question of whether polysemic words are looked up more often than monosemic ones. Here, we also have to take into account that polysemic words are more frequent in most languages. Finally, we present a technique to investigate the time-course of look-up behaviour for specific entries. We exemplify the method by investigating influences of (temporary) social relevance of specific headwords.
The methods utilized in the area of research into dictionary use are established research methods in the social sciences. After explicating the different steps of a typical empirical investigation, this article provides examples of how these different methods are used in various user studies conducted in the field of using online dictionaries. Thereby, different kinds of data collection (surveys as online questionnaires, log files and eye tracking) as well as different research design structures (for instance, ex-post-facto design or experimental design) are discussed.