OPUS 4 | Search

Corpus size strongly matters when analysing word frequency distributions (2024)

In a previous study, Aceves and Evans present a large-scale quantitative information-theoretic analysis of parallel corpus data in ~1,000 languages to show that there are apparently strong associations between the way languages encode information into words and patterns of communication, e.g. the configuration of semantic information. During the peer review process, one reviewer raised the question of the extent to which the presented results depend on different corpus sizes (see the Peer Review File). This is a very important question given that most, if not all, of the quantities associated with word frequency distributions vary systematically with corpus size. While Aceves and Evans claim that corpus size does not affect the results presented, I challenge this view by presenting reanalyses of the data that clearly suggest that it does.

Weniger ist mehr? Eine Analyse zur „Neigung zum Hinzufügen“ im Deutschen anhand des neuen Häufigkeitsdatensatzes DeReKoGram (2024)

Wolfer, Sascha ; Koplenig, Alexander ; Kupietz, Marc ; Müller-Spitzer, Carolin

What lexical factors drive look-ups in the English Wiktionary? (2024)

Lew, Robert ; Wolfer, Sascha

This study aims to establish what lexical factors make it more likely for dictionary users to consult specific articles in a dictionary using the English Wiktionary log files, which include records of user visits over the course of 6 years. Recent findings suggest that lexical frequency is a significant factor predicting look-up behavior, with the more frequent words being more likely to be consulted. Three further lexical factors are brought into focus: (1) age of acquisition; (2) lexical prevalence; and (3) degree of polysemy operationalized as the number of dictionary senses. Age of acquisition and lexical prevalence data were obtained from recent published studies and linked to the list of visited Wiktionary lemmas, whereas polysemy status was derived from Wiktionary entries themselves. Regression modeling confirms the significance of corpus frequency in explaining user interest in looking up words in the dictionary. However, the remaining three factors also make a contribution whose nature is discussed and interpreted. Knowing what makes dictionary users look up words is both theoretically interesting and practically useful to lexicographers, telling them which lexical items should be prioritized in lexicographic work.

Predicting English Wiktionary Consultations (2022)

Wolfer, Sascha ; Lew, Robert

Dictionaries have been part and parcel of literate societies for many centuries. They assist in communication, particularly across different languages, to aid in understanding, creating, and translating texts. Communication problems arise whenever a native speaker of one language comes into contact with a speaker of another language. At the same time, English has established itself as a lingua franca of international communication. This marked tendency gives lexicography of English a particular significance, as English dictionaries are used intensively and extensively by huge numbers of people worldwide.

Korpusfrequenzen und andere Metriken zur Strukturierung von DaF-Lehrmaterial (2021)

Perkuhn, Rainer

Korpora und Fremdsprachendidaktik haben – auch jenseits des angeleiteten oder selbstgesteuerten Arbeitens an den Daten – Berührungspunkte mit langer Tradition, durchaus mit nicht-digitalen Ausläufern, deren korpuslinguistische Dimensionen erst in den letzten Jahrzehnten erschlossen wurden. Worthäufigkeitszählungen, auch vergleichend, in beliebig großen oder auf bestimmte Bedürfnisse zugeschnittenen Datensammlungen lassen sich mit weiteren Metriken verknüpfen, die eine differenzierte Bewertung für die didaktische Relevanz ermöglichen. Kollokations-/Kookkurrenzanalysen helfen, typische Formulierungsmuster zu ermitteln. Dieser Beitrag stellt zunächst diese beiden Herangehensweisen dar. Das Manko der getrennten Betrachtung ist, dass keine der beiden isoliert ausreicht, um die Angemessenheit von Formulierungen zu bewerten hinsichtlich muttersprachlicher Natürlichkeit und Weiterentwicklung des Lernstands. Als Abhilfe wird eine Verknüpfung skizziert, die beide Perspektiven zusammenbringt.

Spiele und Spielgeräte: Tamagotchi, Sudoku oder doch Puzzle? (Aus der Rubrik Neuer Wortschatz) (2021)

Klosa-Kückelhaus, Annette

Von AHA bis zweite Welle – Ein kommentiertes Glossar zur Coronakrise (2021)

Klosa-Kückelhaus, Annette

Mit Entwicklungen in der Welt entsteht auch ein neuer Wortschatz, insbesondere in Zeiten großer gesellschaftlicher Umbrüche oder bedingt durch Krisen, denn neue Dinge, neue Umstände, »neue Normalitäten« müssen bezeichnet werden, damit darüber kommuniziert werden kann. Zugleich steigt die Gebrauchshäufigkeit älterer Wörter, weil sie aktuell für die Verständigung besonders relevant werden. Die in diesem Glossar präsentierten Begriffe thematisieren solche sprachlichen Auswirkungen der Coronakrise.

Font matters: efficient adaptation to monospaced vs. proportional fonts is accompanied by effect-size differences for word frequency and predictability (2019)

Jarosch, Julian ; Schlesewsky, Matthias ; Füssel, Stephan ; Kretzschmar, Franziska

Zwischen den Jahren oder eine Zeit zwischen den Zeiten. Sprachliche Betrachtungen zur "Normalität" (2020)

Möhrs, Christine

(Social) Distancing, (soziale) Distanz oder (soziale) Distanzierung? (2020)

Klosa-Kückelhaus, Annette

Bilder und Metaphern im Wortschatz rund um die Coronapandemie (2020)

Klosa-Kückelhaus, Annette

Verbale "Heißzeit" in Deutschland? (2020)

Park, Maike

Historische Krise - Corona-Gewinner - Wumms: Eine Corona-Geschichte zu Wörtern, die die Wirtschaft bewegt (2020)

Möhrs, Christine

Von Dauerwellen und anderen Wellen in Coronazeiten (2020)

Klosa-Kückelhaus, Annette

Schule trotz(t) Corona: Schule unter dem sprachlichen Einfluss der Corona-Pandemie (2020)

Möhrs, Christine

Medizinisches Vokabular rund um die Coronapandemie (2020)

Klosa-Kückelhaus, Annette

Über Plexiglas, Abstand und Distanz (2020)

Klosa-Kückelhaus, Annette

Von Aluhüten, Verschwörungstheorien und Coronaskepsis (2020)

Klosa-Kückelhaus, Annette

Ein Wortnetz entspinnt sich um "Corona" (2020)

Möhrs, Christine

Maske oder Mundschutz? (2020)

Klosa-Kückelhaus, Annette

Lexical frequency co-determines the speed-curvature relation in articulation (2018)

Tomaschek, Fabian ; Arnold, Denis ; Bröker, Franziska ; Baayen, R. Harald

The relation between speed and curvature provides a characterization of the spatio-temporal orchestration of kinematic movements. For hand movements, this relation has been reported to follow a power law with exponent -1/3. The same power law has been claimed to govern articulatory movements. We studied the functional form of speed as predicted by curvature using electromagnetic articulography, focusing on three sensors: the tongue tip, the tongue body, and the lower lip. Of specific interest to us was the question of whether the speed-curvature relation is modified by articulatory practice, gauged with words’ frequencies of occurrence. Although analyses imposing linearity a priori indeed supported a power law, relaxation of this linearity assumption revealed that the effect of curvature on speed levels off substantially for lower values of curvature. A modification of the power law is proposed that takes this curvature into account. Furthermore, controlling statistically for number of phones and word duration, we observed that the speed-curvature function was further modulated by an interaction of lexical frequency by curvature, such that for increasing frequency, speed decreased slightly for low curvatures while it increased slightly for high curvatures. The modulation of the balance between speed and curvature by lexical frequency provides further evidence that the skill of articulation improves with practice on a word-to-word basis, and challenges theories of speech production.

A New Application for Raising in HPSG: Complex Prepositions (2003)

Trawiński, Beata

One of the most popular techniques used in HPSG-based studies to describe linguistic phenomena is the raising mechanism. Besides ordinary raising verbs or adjectives, this tool has been applied for handling verbal complexes and discontinuous constituents, among other phenomena. In this paper, a new application for raising within the HPSG paradigm will be discussed, thereby investigating data from the prepositional domain. We will analyze linguistic properties of word combinations in German consisting of a preposition, a noun, and another preposition (such as auf Grund von (‘by virtue of’)), thus arguing that raising is the most appropriate method for satisfactorily describing the crucial syntactic features which are typical for those expressions. The objective of this paper is thus to demonstrate the efﬁciency of the raising mechanism as used in HPSG, and therefore, to emphasize the importance of designing a satisfactory uniform theory of raising within this grammar framework.

Growing trees from morphs: Towards data-driven morphological parsing (2015)

Steiner, Petra ; Ruppenhofer, Josef

We present a quantitative approach to disambiguating flat morphological analyses and producing more deeply structured analyses. Based on existing morphological segmentations, possible combinations of resulting word trees for the next level are filtered first by criteria of linguistic plausibility and then by weighting procedures based on the geometric mean. The frequencies for weighting are derived from three different sources (counts of morphs in a lexicon, counts of largest constituents in a lexicon, counts of token frequencies in a corpus) and can be used either to find the best analysis on the level of morphs or on the next higher constituent level. The evaluation shows that for this task corpus-based frequency counts are slightly superior to counts of lexical data.

A new application for raising in HPSG: complex prepositions (2008)

Trawiński, Beata

One of the most popular techniques used in HPSG-based studies to describe linguistic phenomena is the raising mechanism. Besides ordinary raising verbs or adjectives, this tool has been applied for handling verbal complexes and discontinuous constituents, among other phenomena. In this paper, a new application for raising within the HPSG paradigm will be discussed, thereby investigating data from the prepositional domain. We will analyze linguistic properties of word combinations in German consisting of a preposition, a noun, and another preposition (such as auf Grund von (‘by virtue of’)), thus arguing that raising is the most appropriate method for satisfactorily describing the crucial syntactic features which are typical for those expressions. The objective of this paper is thus to demonstrate the efficiency of the raising mechanism as used in HPSG, and therefore, to emphasize the importance of designing a satisfactory uniform theory of raising within this grammar framework.

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

24 search hits