Volltext-Downloads (blau) und Frontdoor-Views (grau)

RefCo and its checker: improving language documentation corpora’s reusability through a semi-automatic review process

  • The QUEST (QUality ESTablished) project aims at ensuring the reusability of audio-visual datasets (Wamprechtshammer et al., 2022) by devising quality criteria and curating processes. RefCo (Reference Corpora) is an initiative within QUEST in collaboration with DoReCo (Documentation Reference Corpus, Paschen et al. (2020)) focusing on language documentation projects. Previously, Aznar and Seifart (2020) introduced a set of quality criteria dedicated to documenting fieldwork corpora. Based on these criteria, we establish a semi-automatic review process for existing and work-in-progress corpora, in particular for language documentation. The goal is to improve the quality of a corpus by increasing its reusability. A central part of this process is a template for machine-readable corpus documentation and automatic data verification based on this documentation. In addition to the documentation and automatic verification, the process involves a human review and potentially results in a RefCo certification of the corpus. For each of these steps, we provide guidelines and manuals. We describe the evaluation process in detail, highlight the current limits for automatic evaluation and how the manual review is organized accordingly.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar


Author:Herbert LangeORCiDGND, Jocelyn AznarORCiD
Parent Title (English):Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). Marseille, 20-25 June 2022
Publisher:European Language Resources Association (ELRA)
Place of publication:Paris
Editor:Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Document Type:Conference Proceeding
Year of first Publication:2022
Date of Publication (online):2023/01/17
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Tag:QUEST; annotated corpora; language documentation; language resources; oral language; quality checking; reusability
GND Keyword:Annotation; Computerlinguistik; Datensatz; Dokumentation; Gesprochene Sprache; Korpus <Linguistik>; Richtlinie; Sprachdaten; Zertifizierung
First Page:2721
Last Page:2729
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Program areas:S2: Forschungskoordination und –infrastrukturen
Licence (English):License LogoCreative Commons - Attribution-NonCommercial 4.0 International