Evaluating Workflows for Creating Orthographic Transcripts for Oral Corpora by Transcribing from Scratch or Correcting ASR-Output
- Research projects incorporating spoken data require either a selection of existing speech corpora, or they plan to record new data. In both cases, recordings need to be transcribed to make them accessible to analysis. Underestimating the effort of transcribing can be risky. Automatic Speech Recognition (ASR) holds the promise to considerably reduce transcription effort. However, few studies have so far attempted to evaluate this potential. The present paper compares efforts for manual transcription vs. correction of ASR-output. We took recordings from corpora of varying settings (interview, colloquial talk, dialectal, historic) and (i) compared two methods for creating orthographic transcripts: transcribing from scratch vs. correcting automatically created transcripts. And (ii) we evaluated the influence of the corpus characteristics on the correcting efficiency. Results suggest that for the selected data and transcription conventions, transcribing and correcting still take equally long with 7 times real-time on average. The more complex the primary data, the more time has to be spent on corrections. Despite the impressive latest developments in speech technology, to be a real help for conversation analysts or dialectologists, ASR systems seem to require even more improvement, or we need sufficient and appropriate data for training such systems.
| Author: | Jan GorischORCiDGND, Thomas Schmidt |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-126955 |
| URL: | https://aclanthology.org/2024.lrec-main.0/ |
| Parent Title (English): | Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) |
| Publisher: | ELRA Language Resource Association |
| Place of publication: | Paris |
| Editor: | Nicoletta CalzolariORCiDGND, Min-Yen KanORCiD, Veronique HosteORCiDGND, Alessandro LenciORCiDGND, Sakriani SaktiORCiDGND, Nianwen XueORCiD |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of first Publication: | 2024 |
| Date of Publication (online): | 2024/06/04 |
| Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
| Publicationstate: | Veröffentlichungsversion |
| Tag: | ASR-correction; automatic transcription; corpus curation; oral corpora; spoken German |
| GND Keyword: | Automatische Spracherkennung; Deutsch; Gesprochene Sprache; Korpus <Linguistik> |
| First Page: | 6564 |
| Last Page: | 6574 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| BDSL-Classification: | Lexikographie, Wörterbücher |
| Leibniz-Classification: | Sprache, Linguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Program areas: | Pragmatik |
| Licence (German): | Creative Commons - CC BY-NC - Namensnennung - Nicht kommerziell 4.0 International |


