Volltext-Downloads (blau) und Frontdoor-Views (grau)

Transcription Bottleneck of Speech Corpus Exploitation

  • While written corpora can be exploited without any linguistic annotations, speech corpora need at least a basic transcription to be of any use for linguistic research. The basic annotation of speech data usually consists of time-aligned orthographic transcriptions. To answer phonetic or phonological research questions, phonetic transcriptions are needed as well. However, manual annotation is very time-consuming and requires considerable skill and near-native competence. Therefore it can take years of speech corpus compilation and annotation before any analyses can be carried out. In this paper, approaches that address the transcription bottleneck of speech corpus exploitation are presented and discussed, including crowdsourcing the orthographic transcription, automatic phonetic alignment, and query-driven annotation. Currently, query-driven annotation and automatic phonetic alignment are being combined and applied in two speech research projects at the Institut für Deutsche Sprache (IDS), whereas crowdsourcing the orthographic transcription still awaits implementation.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Caren Brinckmann
URN:urn:nbn:de:bsz:mh39-68329
URL:http://webfolder.eurac.edu/EURAC/Publications/Institutes/autonomies/commul/LULCL%20II%202008_web_publication.pdf
ISBN:978-88-88906-52-2
Parent Title (English):Proceedings of the Second Colloquium on Lesser Used Languages and Computer Linguistics (LULCL II). "Combining efforts to foster computational support of minority languages"
Series (Serial Number):EURAC book (54)
Publisher:Europäische Akademie
Place of publication:Bozen
Editor:Verena Lyding
Document Type:Part of a Book
Language:English
Year of first Publication:2009
Date of Publication (online):2017/12/13
Publicationstate:Veröffentlichungsversion
Reviewstate:(Verlags)-Lektorat
Tag:Institut für Deutsche Sprache <Mannheim>
GND Keyword:Annotation; Gesprochene Sprache; Korpus <Linguistik>; Lautschrift
First Page:165
Last Page:179
Dewey Decimal Classification:400 Sprache / 430 Deutsch
BDSL-Classification:Sprache im 20. Jahrhundert. Gegenwartssprache
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Open Access?:Ja
Licence (German):Es gilt das UrhG