Using Automatic Speech Recognition in Spoken Corpus Curation

The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.

Metadaten
Author:	Jan Gorisch ORCiD GND, Michael Gref, Thomas Schmidt ORCiD GND
URN:	urn:nbn:de:bsz:mh39-98732
URL:	http://www.lrec-conf.org/proceedings/lrec2020/index.html#6423
ISBN:	979-10-95546-34-4
Parent Title (English):	Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), May 11-16, 2020, Palais du Pharo, Marseille, France
Publisher:	European Language Resources Association
Place of publication:	Paris
Editor:	Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Document Type:	Part of a Book
Language:	English
Year of first Publication:	2020
Date of Publication (online):	2020/06/03
Publicationstate:	Zweitveröffentlichung
Reviewstate:	Peer-Review
Tag:	ASR; Ripuarian; automatic transcription; corpus curation; oral corpora; pluricentric; spoken German
GND Keyword:	Automatische Spracherkennung; Gesprochene Sprache; Korpus <Linguistik>; Plurizentrische Sprache; Sprachgeografie
First Page:	6423
Last Page:	6428
Note:	Gefördert durch den Open-Access-Monografienfonds der Leibniz-Gemeinschaft
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Computerlinguistik
Linguistics-Classification:	Korpuslinguistik
Program areas:	P2: Mündliche Korpora
Licence (English):	Creative Commons - Attribution-NonCommercial 4.0 International

Open Access