OPUS 4 | Search

64 search hits

1 to 10

Sort by

Why the quantitative analysis of diachronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions (2017)

Koplenig, Alexander

Recently, a claim was made, on the basis of the German Google Books 1-gram corpus (Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2010; 331: 176–82), that there was a linear relationship between six non-technical non-Nazi words and three ‘explicitly Nazi words’ in times of World War II (Caruana-Galizia. 2015. Politics and the German language: Testing Orwell’s hypothesis using the Google N-Gram corpus. Digital Scholarship in the Humanities [Online]. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqv011 (accessed 15 April 2015)). Here, I try to show that apparent relationships like this are the result of misspecified models that do not take into account the temporal aspect of time-series data. The main point of this article is to demonstrate why such analyses run the risk of incorrect statistical inference, where potential effects are both meaningless and can potentially lead to wrong conclusions.

Web corpora - the best possible solution for tracking rare phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian (2017)

Jurkiewicz-Rohrbacher, Edyta ; Kolaković, Zrinka ; Hansen, Björn

Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue that web corpora currently offer the best source of empirical material for studying Clitic Climbing in BCS. They thus allow the most accurate description of this phenomenon, as less frequent constructions can be tracked only in big, well-annotated data sources. We compare the properties of web corpora for BCS with traditional sources and give examples of studies on CC based on web corpora. Furthermore, we discuss problems related to web corpora and suggest some improvements for the future.

Visualizing Language Change in a Corpus of Contemporary German (2017)

Fankhauser, Peter ; Kupietz, Marc

Visualisierung linguistischer Daten mit der freien Grafik- und Statistikumgebung R (2017)

Wolfer, Sascha ; Hansen-Morath, Sandra

Visual Correlation for Detecting Patterns in Language Change (2017)

Fankhauser, Peter ; Kupietz, Marc

Verwechslungsfehler oder Wandel - sprachliche Veränderungen der Gegenwart und ihre lexikografische Dokumentation (2017)

Storjohann, Petra ; Schnörch, Ulrich

Lexicographic meaning descriptions of German lexical items which are formally and semantically similar and therefore easily confused (so-called paronyms) often do not reflect their current usage of lexical items. They can even contradict one’s personal intuition or disagree with lexical usage as observed in public discourse. The reasons are manifold. Language data used for compiling dictionaries is either outdated, or lexicographic practice is rather conventional and does not take advantage of corpus-assisted approaches to semantic analysis. Despite of various modern electronic or online reference works speakers face uncertainties when dealing with easily confusable words. These are for example sensibel/sensitiv (sensitive) or kindisch/kindlich (childish/childlike). Existing dictionaries often do not provide satisfactory answers as to how to use these sets correctly. Numerous questions addressed in online forums show where uncertainties with paronyms are and why users demand further assistance concerning proper contextual usage (cf. Storjohann 2015). There are different reasons why users misuse certain items or mix up words which are similar in form and meaning. As data from written and more spontaneous language resources suggest, some confusions arise due to ongoing semantic change in the current use of some paronyms. This paper identifies shortcomings of contemporary German Dictionaries and discusses innovative ways of empirical lexicographic work that might pave the way for a new data-driven, descriptive reference work of confusable German terms. Currently, such a guide is being developed at the Institute for German Language in Mannheim implementing corpora and diverse corpus-analytical methods. Its objective is to compile a dictionary with contrastive entries which is a useful reference tool in situation of language doubt. At the same time, it aims at sensitizing users of context dependency and language change.

Variationistische Korpusstudien (2017)

Lüdeling, Anke

Der vorliegende Aufsatz beschäftigt sich mit einigen Aspekten der variationistischen Annotation von Korpusdaten. Anhand von mehreren Beispielen wird gezeigt, dass der Vergleich von Kategorien in einem Korpus oder der Vergleich von zwei Korpora nur unter bestimmten Bedingungen variationistisch interpretiert werden kann. Da die Definition von Variablen oft schwierig ist und die Zuordnung von Varianten zu Variablen je nach Forschungsfrage unterschiedlich sein kann, müssen Variablen und Varianten in einem Korpus (für alle transparent und nachvollziehbar) annotiert werden. Dabei wird für eine offene Korpusarchitektur argumentiert, in der in einem bestehenden Korpus jederzeit Variablen und Varianten hinzugefügt werden können.

User's Guide for the ZAS Database of Clause-Embedding Predicates (2017)

Stiebels, Barbara ; McFadden, Thomas ; Schwabe, Kerstin ; Solstad, Torgrim ; Kellner, Elisa ; Sommer, Livia ; Stoltmann, Katarzyna

The ZAS database of clause-embedding predicates (2017)

Stiebels, Barbara ; McFadden, Thomas ; Schwabe, Kerstin ; Solstad, Torgrim ; Kellner, Elisa ; Sommer, Livia ; Stoltmann, Katarzyna

The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII (2017)

Koplenig, Alexander

The Google Ngram Corpora seem to offer a unique opportunity to study linguistic and cultural change in quantitative terms. To avoid breaking any copyright laws, the data sets are not accompanied by any metadata regarding the texts the corpora consist of. Some of the consequences of this strategy are analyzed in this article. I chose the example of measuring censorship in Nazi Germany, which received widespread attention and was published in a paper that accompanied the release of the Google Ngram data (Michel et al. (2010): Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82). I show that without proper metadata, it is unclear whether the results actually reflect any kind of censorship at all. Collectively, the findings imply that observed changes in this period of time can only be linked directly to World War II to a certain extent. Therefore, instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’. On a more general level, the analysis demonstrates the importance of metadata, the availability of which is not just a nice add-on, but a powerful source of information for the digital humanities.

1 to 10

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

64 search hits