The impact of lacking metadata and data truncation for the measurement of cultural and linguistic change using the Google Ngram datasets

As a result of legal restrictions the Google Ngram Corpora datasets are a) not accompanied by any metadata regarding the texts the corpora consist of and the data are b) truncated to prevent an indirect conclusion from the n-gram to the author of the text. Some of the consequences of this strategy are discussed in this article.

Metadaten
Author:	Alexander Koplenig ORCiD GND
URN:	urn:nbn:de:bsz:mh39-31557
Publisher:	Institut für Deutsche Sprache
Place of publication:	Mannheim
Document Type:	Preprint
Language:	German
Year of first Publication:	2014
Date of Publication (online):	2014/10/17
Tag:	N-Gramm
GND Keyword:	Datenstruktur; Korpus <Linguistik>; Kulturwandel; Metadaten; Sprachstatistik; Sprachwandel
Page Number:	28 S., 2 Anhänge
Note:	An updated version of this paper entitled: "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram datasets – reconstructing the composition of the German corpus in times of WWII" is accepted for publication in the the journal "Digital Scholarship in the Humanities" (http://dsh.oxfordjournals.org/content/early/2015/09/02/llc.fqv037).
DDC classes:	400 Sprache / 410 Linguistik / 410 Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Computerlinguistik
Licence (German):	Urheberrechtlich geschützt

Open Access