Refine
Year of publication
Document Type
- Part of a Book (41)
- Conference Proceeding (16)
- Article (10)
- Working Paper (5)
- Book (4)
- Other (2)
- Preprint (1)
Is part of the Bibliography
- yes (79) (remove)
Keywords
- Korpus <Linguistik> (55)
- Gesprochene Sprache (43)
- Deutsch (22)
- Transkription (11)
- Annotation (9)
- Forschungsdaten (7)
- Germanistik (7)
- Sprachdaten (7)
- oral corpora (6)
- Leibniz-Institut für Deutsche Sprache. Jahrestagung (58. : 2022 : Online) (5)
Publicationstate
- Veröffentlichungsversion (34)
- Zweitveröffentlichung (28)
- Postprint (8)
Reviewstate
- (Verlags)-Lektorat (32)
- Peer-Review (28)
- (Verlags-)lektorat (1)
- Peer-review (1)
- Verlags-Lektorat (1)
Publisher
This article discusses questions concerning the creation, annotation and sharing of spoken language corpora. We use the Hamburg Map Task Corpus (HAMATAC), a small corpus in which advanced learners of German were recorded solving a map task, as an example to illustrate our main points. We first give an overview of the corpus creation and annotation process including recording, metadata documentation, transcription and semi-automatic annotation of the data. We then discuss the manual annotation of disfluencies as an example case in which many of the typical and challenging problems for data reuse – in particular the reliability of interpretative annotations – are revealed.
EXMARaLDA
(2014)
"FOLK is the ""Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK)"" (eng.: research and teaching corpus of spoken German). The project has set itself the aim of building a corpus of German conversations which a) covers a broad range of interaction types in private, institutional and public settings, b) is sufficiently large and diverse and of sufficient quality to support different qualitative and quantitative research approaches, c) is transcribed, annotated and made accessible according to current technological standards, and d) is available to the scientific community on a sound legal basis and without unnecessary restrictions of usage. This paper gives an overview of the corpus design, the strategies for acquisition of a diverse range of interaction data, and the corpus construction workflow from recording via transcription an annotation to dissemination."
The Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD2, http://dgd.ids-mannheim.de) is the central platform for publishing and disseminating spoken language corpora from the Archive of Spoken German (Archiv für Gesprochenes Deutsch, AGD, http://agd.ids-mannheim.de) at the Institute for the German Language in Mannheim. The corpora contained in the DGD2 come from a variety of sources, some of them in-house projects, some of them external projects. Most of the corpora were originally intended either for research into the (dialectal) variation of German or for studies in conversation analysis and related fields. The AGD has taken over the task of permanently archiving these resources and making them available for reuse to the research community. To date, the DGD2 offers access to 19 different corpora, totalling around 9000 speech events, 2500 hours of audio recordings or 8 million transcribed words. This paper gives an overview of the data made available via the DGD2, of the technical basis for its implementation, and of the most important functionalities it offers. The paper concludes with information about the users of the database and future plans for its development.