Refine
Document Type
Language
- English (2)
Has Fulltext
- yes (2)
Keywords
- DSSSL (1)
- Deutsch (1)
- Deutsches Referenzkorpus (DeReKo) (1)
- Institut für Deutsche Sprache <Mannheim> (1)
- Korpus <Linguistik> (1)
- Morphology (1)
- Speech Corpora (1)
- Speech Lexica (1)
- Text Technology (1)
- Textkorpus (1)
Publicationstate
Reviewstate
Publisher
- European Language Resources Association (ELRA) (2) (remove)
We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future research interface. Due to the recent acquisition of new licenses, DeReKo has grown by a factor of four in the first half of 2014, mostly in the area of newspaper text, and presently contains over 24 billion word tokens. Other strata, like fictional texts, web corpora, in particular CMC texts, and spoken but conceptually written texts have also increased significantly. We report on the newly acquired corpora that led to the major increase, on the principles and strategies behind our corpus acquisition activities, and on our solutions for the emerging legal, organisational, and technical challenges.