Refine
Year of publication
- 2017 (1)
Document Type
Language
- English (1) (remove)
Has Fulltext
- yes (1)
Is part of the Bibliography
- no (1)
Keywords
- Corpus linguistics (1)
- Corpus technology (1)
- Datenmanagement (1)
- Englisch (1)
- Korpus <Linguistik> (1)
- Literary corpus (1)
- Metadata (1)
- Metadaten (1)
- Texttechnologie (1)
Publicationstate
Reviewstate
- Peer-Review (1) (remove)
Publisher
This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).