TY - CHAP U1 - Konferenzveröffentlichung A1 - McClure, David A1 - Algee-Hewitt, Mark A1 - Douris, Steele A1 - Fredner, Erik A1 - Walser, Hannah ED - Bański, Piotr ED - Kupietz, Marc ED - Lüngen, Harald ED - Rayson, Paul ED - Biber, Hanno ED - Breiteneder, Evelyn ED - Clematide, Simon ED - Mariani, John ED - Stevenson, Mark ED - Sick, Theresa T1 - Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management T2 - Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017 N2 - This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010). KW - Korpus KW - Englisch KW - Texttechnologie KW - Datenmanagement KW - Metadaten KW - Corpus linguistics KW - Literary corpus KW - Metadata KW - Corpus technology Y1 - 2017 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62617 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62617 SP - 25 EP - 29 S1 - 5 PB - Institut für Deutsche Sprache CY - Mannheim ER -