Enhancing speech corpus resources with multiple lexical tag layers
- We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).
Author: | Andreas WittORCiDGND, Harald LüngenGND, Dafydd Gibbon |
---|---|
URN: | urn:nbn:de:bsz:mh39-45517 |
URL: | http://lrec-conf.org/proceedings/lrec2000/ |
Parent Title (English): | Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000). Athen, Griechenland |
Publisher: | European Language Resources Association (ELRA) |
Place of publication: | Paris |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2000 |
Date of Publication (online): | 2016/01/11 |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | (Verlags)-Lektorat |
Tag: | DSSSL; Morphology; Speech Corpora; Speech Lexica; Text Technology; XML |
Page Number: | 5 |
DDC classes: | 400 Sprache / 410 Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Korpuslinguistik |
Licence (German): | ![]() |