Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management
- This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).
| Author: | David McClure, Mark Algee-Hewitt, Steele Douris, Erik Fredner, Hannah Walser |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-62617 |
| Parent Title (English): | Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017 |
| Publisher: | Institut für Deutsche Sprache |
| Place of publication: | Mannheim |
| Editor: | Piotr BańskiORCiDGND, Marc KupietzORCiDGND, Harald LüngenGND, Paul Rayson, Hanno Biber, Evelyn BreitenederGND, Simon Clematide, John Mariani, Mark Stevenson, Theresa Sick |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of first Publication: | 2017 |
| Date of Publication (online): | 2017/07/05 |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | Peer-Review |
| Tag: | Corpus linguistics; Corpus technology; Literary corpus; Metadata |
| GND Keyword: | Datenmanagement; Englisch; Korpus <Linguistik>; Metadaten; Texttechnologie |
| Page Number: | 5 |
| First Page: | 25 |
| Last Page: | 29 |
| DDC classes: | 400 Sprache |
| Open Access?: | ja |
| Leibniz-Classification: | Sprache, Linguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Conferences, Workshops: | CMLC-5 + BigNLP / 5th Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing |
| Licence (German): | Creative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0 Deutschland |


