TY - CHAP U1 - Konferenzveröffentlichung A1 - Gorisch, Jan A1 - Gref, Michael A1 - Schmidt, Thomas ED - Calzolari, Nicoletta ED - Béchet, Frédéric ED - Blache, Philippe ED - Choukri, Khalid ED - Cieri, Christopher ED - Declerck, Thierry ED - Goggi, Sara ED - Isahara, Hitoshi ED - Maegaard, Bente ED - Mariani, Joseph ED - Mazo, Hélène ED - Moreno, Asuncion ED - Odijk, Jan ED - Piperidis, Stelios T1 - Using Automatic Speech Recognition in Spoken Corpus Curation T2 - Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), May 11-16, 2020, Palais du Pharo, Marseille, France N2 - The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages. KW - Automatische Spracherkennung KW - Korpus KW - Plurizentrische Sprache KW - Sprachgeografie KW - Gesprochene Sprache KW - oral corpora KW - automatic transcription KW - ASR KW - corpus curation KW - pluricentric KW - spoken German KW - Ripuarian Y1 - 2020 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98351 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98351 UR - http://www.lrec-conf.org/proceedings/lrec2020/index.html SN - 979-10-95546-34-4 SB - 979-10-95546-34-4 N1 - Aufgrund von durch den Verlag vorgenommenen Änderungen am Originaltext wurde der Volltext für diesen Eintrag entfernt. Ein neuer Eintrag mitsamt aktualisiertem Volltext ist unter folgendem Link zu finden: https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9873 SP - 6425 EP - 6430 PB - European Language Resources Association CY - Paris ER -