TY - JOUR A1 - Koplenig, Alexander T1 - The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII JF - Digital Scholarship in the Humanities N2 - The Google Ngram Corpora seem to offer a unique opportunity to study linguistic and cultural change in quantitative terms. To avoid breaking any copyright laws, the data sets are not accompanied by any metadata regarding the texts the corpora consist of. Some of the consequences of this strategy are analyzed in this article. I chose the example of measuring censorship in Nazi Germany, which received widespread attention and was published in a paper that accompanied the release of the Google Ngram data (Michel et al. (2010): Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82). I show that without proper metadata, it is unclear whether the results actually reflect any kind of censorship at all. Collectively, the findings imply that observed changes in this period of time can only be linked directly to World War II to a certain extent. Therefore, instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’. On a more general level, the analysis demonstrates the importance of metadata, the availability of which is not just a nice add-on, but a powerful source of information for the digital humanities. KW - Sprachwandel KW - Sprachstatistik KW - Metadaten KW - Kulturwandel KW - Korpus KW - Datenstruktur Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-49493 N1 - Preprint is published under http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-31557 Advance Access published September, 12, 2015 VL - 32 IS - 1 SP - 169 EP - 188 PB - Oxford University Press (OUP) CY - Oxford ER - TY - JOUR A1 - Koplenig, Alexander T1 - Why the quantitative analysis of diachronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions JF - Digital Scholarship in the Humanities N2 - Recently, a claim was made, on the basis of the German Google Books 1-gram corpus (Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2010; 331: 176–82), that there was a linear relationship between six non-technical non-Nazi words and three ‘explicitly Nazi words’ in times of World War II (Caruana-Galizia. 2015. Politics and the German language: Testing Orwell’s hypothesis using the Google N-Gram corpus. Digital Scholarship in the Humanities [Online]. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqv011 (accessed 15 April 2015)). Here, I try to show that apparent relationships like this are the result of misspecified models that do not take into account the temporal aspect of time-series data. The main point of this article is to demonstrate why such analyses run the risk of incorrect statistical inference, where potential effects are both meaningless and can potentially lead to wrong conclusions. KW - Sprachstatistik KW - Korpus KW - Internet Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43679 SN - 2055-768X N1 - Advance Access published August 13, 2015 Print version: Digital Scholarship Humanities (2017) 32 (1): 159-168. VL - 32 (2017) IS - 1 SP - 159 EP - 168 PB - Oxford University Press (OUP) CY - Oxford ER -