Refine
Document Type
- Part of a Book (2)
- Conference Proceeding (1)
Language
- English (3)
Keywords
- Plurizentrische Sprache (3) (remove)
Publicationstate
Reviewstate
- Peer-Review (2)
- (Verlags)-Lektorat (1)
Publisher
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
The contribution will focus on aspects of pluricentricity in spoken Standard German. After a brief overview over the historical and dialectal background of the linguistic diversity in the German speaking area, the regionally balanced speech-corpus "German today” is presented, which has been collected for the analysis of the (regional) variation of spoken Standard German. Aspects of pluricentric German will be discussed by means of both the distribution of certain phonetic variables and a short analysis of regional differences in the use of certain conversational constructions. It is argued that pluricentric structures are constituted by a set of linguistic features on different levels of description. Above all, the analysis tries to reveal traces of the impact of both traditional dialects and national or even subnational political units on the constitution of the standard varieties.