Volltext-Downloads (blau) und Frontdoor-Views (grau)

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

  • This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar


Author:David McClure, Mark Algee-Hewitt, Steele Douris, Erik Fredner, Hannah Walser
Parent Title (English):Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017
Publisher:Institut für Deutsche Sprache
Place of publication:Mannheim
Editor:Piotr Bański, Marc Kupietz, Harald Lüngen, Paul Rayson, Hanno Biber, Evelyn Breiteneder, Simon Clematide, John Mariani, Mark Stevenson, Theresa Sick
Document Type:Conference Proceeding
Year of first Publication:2017
Date of Publication (online):2017/07/05
Tag:Corpus linguistics; Corpus technology; Literary corpus; Metadata
GND Keyword:Datenmanagement; Englisch; Korpus <Linguistik>; Metadaten; Texttechnologie
Page Number:5
First Page:25
Last Page:29
DDC classes:400 Sprache
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Conferences, Workshops:CMLC-5 + BigNLP / 5th Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing
Licence (German):License LogoCreative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0 Deutschland