Refine
Year of publication
- 2022 (2) (remove)
Document Type
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- yes (2)
Keywords
- Annotation (2)
- Computerlinguistik (2)
- Datensatz (2)
- Dokumentation (2)
- Korpus <Linguistik> (2)
- Archivierung (1)
- FAIR (1)
- Gesprochene Sprache (1)
- Metadaten (1)
- QUEST (1)
Publicationstate
Reviewstate
- Peer-Review (2)
Publisher
The QUEST (QUality ESTablished) project aims at ensuring the reusability of audio-visual datasets (Wamprechtshammer et al., 2022) by devising quality criteria and curating processes. RefCo (Reference Corpora) is an initiative within QUEST in collaboration with DoReCo (Documentation Reference Corpus, Paschen et al. (2020)) focusing on language documentation projects. Previously, Aznar and Seifart (2020) introduced a set of quality criteria dedicated to documenting fieldwork corpora. Based on these criteria, we establish a semi-automatic review process for existing and work-in-progress corpora, in particular for language documentation. The goal is to improve the quality of a corpus by increasing its reusability. A central part of this process is a template for machine-readable corpus documentation and automatic data verification based on this documentation. In addition to the documentation and automatic verification, the process involves a human review and potentially results in a RefCo certification of the corpus. For each of these steps, we provide guidelines and manuals. We describe the evaluation process in detail, highlight the current limits for automatic evaluation and how the manual review is organized accordingly.
Metadata provides important information relevant both to finding and understanding corpus data. Meaningful linguistic data requires both reasonable annotations and documentation of these annotations. This documentation is part of the metadata of a dataset. While corpus documentation has often been provided in the form of accompanying publications, machinereadable metadata, both containing the bibliographic information and documenting the corpus data, has many advantages. Metadata standards allow for the development of common tools and interfaces. In this paper I want to add a new perspective from an archive’s point of view and look at the metadata provided for four learner corpora and discuss the suitability of established standards for machine-readable metadata. I am are aware that there is ongoing work towards metadata standards for learner corpora. However, I would like to keep the discussion going and add another point of view: increasing findability and reusability of learner corpora in an archiving context.