Refine
Document Type
- Article (3)
Language
- English (3)
Has Fulltext
- yes (3)
Is part of the Bibliography
- yes (3)
Keywords
- Sprachstatistik (3) (remove)
Publicationstate
- Postprint (1)
- Veröffentlichungsversion (1)
Reviewstate
- Peer-Revied (3) (remove)
Publisher
- Oxford University Press (OUP) (2)
- Buske (1)
The Google Ngram Corpora seem to offer a unique opportunity to study linguistic and cultural change in quantitative terms. To avoid breaking any copyright laws, the data sets are not accompanied by any metadata regarding the texts the corpora consist of. Some of the consequences of this strategy are analyzed in this article. I chose the example of measuring censorship in Nazi Germany, which received widespread attention and was published in a paper that accompanied the release of the Google Ngram data (Michel et al. (2010): Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82). I show that without proper metadata, it is unclear whether the results actually reflect any kind of censorship at all. Collectively, the findings imply that observed changes in this period of time can only be linked directly to World War II to a certain extent. Therefore, instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’. On a more general level, the analysis demonstrates the importance of metadata, the availability of which is not just a nice add-on, but a powerful source of information for the digital humanities.
Recently, a claim was made, on the basis of the German Google Books 1-gram corpus (Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2010; 331: 176–82), that there was a linear relationship between six non-technical non-Nazi words and three ‘explicitly Nazi words’ in times of World War II (Caruana-Galizia. 2015. Politics and the German language: Testing Orwell’s hypothesis using the Google N-Gram corpus. Digital Scholarship in the Humanities [Online]. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqv011 (accessed 15 April 2015)). Here, I try to show that apparent relationships like this are the result of misspecified models that do not take into account the temporal aspect of time-series data. The main point of this article is to demonstrate why such analyses run the risk of incorrect statistical inference, where potential effects are both meaningless and can potentially lead to wrong conclusions.
The article investigates the conditions under which the w-relativizer was appears instead of the d-relativzer das in German relative clauses. Building on Wiese 2013, we argue that was constitutes the elsewhere case that applies when identification with the antecedent cannot be established by syntactic means via upward agreement with respect to phi-features. Corpuslinguistic results point to the conclusion that this is the case whenever there is no lexical nominal in the antecedent that, following Geach 1962 and Baker 2003, supplies a criterion of identity needed to establish sameness of reference between the antecedent and the relativizer.