TY  - CHAP
U1  - Konferenzveröffentlichung
A1  - McClure, David
A1  - Algee-Hewitt, Mark
A1  - Douris, Steele
A1  - Fredner, Erik
A1  - Walser, Hannah
ED  - Bański, Piotr
ED  - Kupietz, Marc
ED  - Lüngen, Harald
ED  - Rayson, Paul
ED  - Biber, Hanno
ED  - Breiteneder, Evelyn
ED  - Clematide, Simon
ED  - Mariani, John
ED  - Stevenson, Mark
ED  - Sick, Theresa
T1  - Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management
T2  - Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017
N2  - This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).
KW  - Korpus <Linguistik>
KW  - Englisch
KW  - Texttechnologie
KW  - Datenmanagement
KW  - Metadaten
KW  - Corpus linguistics
KW  - Literary corpus
KW  - Metadata
KW  - Corpus technology
Y1  - 2017
U6  - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62617
UN  - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62617
SP  - 25
EP  - 29
S1  - 5
PB  - Institut für Deutsche Sprache
CY  - Mannheim
ER  -