Refine
Document Type
- Part of a Book (11)
- Article (5)
- Conference Proceeding (4)
Is part of the Bibliography
- yes (20)
Keywords
- Korpus <Linguistik> (17)
- Gesprochene Sprache (12)
- Deutsch (6)
- Forschungsmethode (4)
- Automatische Spracherkennung (3)
- Sprachanalyse (3)
- Transkription (3)
- ASR (2)
- Annotation (2)
- Datenbank (2)
Publicationstate
- Zweitveröffentlichung (20) (remove)
Reviewstate
- (Verlags)-Lektorat (10)
- Peer-Review (10)
Publisher
- de Gruyter (6)
- European Language Resources Association (3)
- Wilhelm Fink (3)
- Narr (2)
- CLARIN (1)
- Edinburgh University Press (1)
- Erich Schmidt (1)
- Klostermann (1)
- Sage Publishing (1)
- Université catholique de Louvain (1)
In this paper, we present an overview of freely available web applications providing online access to spoken language corpora. We explore and discuss various solutions with which the corpus providers and corpus platform developers address the needs of researchers who are working with spoken language. The paper aims to contribute to the long-overdue exchange and discussion of methods and best practices in the design of online access to spoken language corpora.
KonsortSWD ist das NFDI Konsortium für die Sozial-, Verhaltens-, Bildungs- und Wirtschaftswissenschaften. Für die äußerst vielfältigen Datentypen und Forschungsmethoden bauen die Beteiligten im Rahmen der NFDI eine bereits bestehende Forschungsdateninfrastruktur aus und ergänzen neue integrierende Dienste. Basis sind die heute 41 vom Rat für Sozial- und Wirtschaftsdaten akkreditierten Forschungsdatenzentren (FDZ). FDZ sind Spezialsammlungen zu jeweils spezifischen Forschungsdaten, z.B. aus der qualitativen Sozialforschung, und können so Forschende auf Basis einer ausführlichen Expertise zu diesen Daten beraten. Neben der Unterstützung der FDZ baut KonsortSWD auch neue Dienste in den Bereichen Datenproduktion, Datenzugang und Technische Lösungen auf.
We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 "Language resource management - Transcription of spoken language". The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung
(2022)
This article provides an overview of methodological and technical issues that arise in the collection, indexing and use of spoken learner corpora, i. e. corpora containing spoken utterances of learners of a target language. After an introductory discussion of the most important special features of this type of corpus that distinguish it from written language learner corpora and spoken corpora with L1 speakers, we will go into more detail on questions of corpus design. The main part of the paper is then an overview of the methodological and technical procedures of the individual steps of collecting, indexing, providing and using spoken learner corpora. The main aim of this overview is to highlight practices that can be considered best practices according to the current state of research. Finally, we outline the challenges that still exist for this type of corpus.
This presentation introduces a new collaborative project: the International Comparable Corpus (ICC) (https://korpus.cz/icc), to be compiled from European national, standard(ised) languages, using the protocols for text categories and their quantities of texts in the International Corpus of English (ICE).
Einleitung
(2023)