Refine
Document Type
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- no (2)
Keywords
- Metadata (2) (remove)
Publicationstate
- Veröffentlichungsversion (2) (remove)
Reviewstate
- Peer-Review (1)
Publisher
This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).
Historical cabinet protocols are a useful resource which enable historians to identify the opinions expressed by politicians on different subjects and at different points of time. While cabinet protocols are often available in digitized form, so far the only method to access their information content is by keyword-based search, which often returns sub-optimal results. We present a method for enriching German cabinet protocols with information about the originators of statements. This requires automatic speaker attribution. In order to avoid costly manual annotation of training data, we design a rule-based system which exploits morpho-syntactic cues. Unlike many other approaches, our method can also deal with cases in which the speaker is not explicitly identified in the sentence itself. This is an important capability as 45% of all sentences in the data constitute reported speech whose speakers are not explicitly marked. Our system is able to detect implicit speakers by taking into account signals of speaker continuity. We show that such a system obtains good results, especially with respect to recall which is particularly important for information access.