Volltext-Downloads (blau) und Frontdoor-Views (grau)

Metadata formats for learner corpora: case study and discussion

  • Metadata provides important information relevant both to finding and understanding corpus data. Meaningful linguistic data requires both reasonable annotations and documentation of these annotations. This documentation is part of the metadata of a dataset. While corpus documentation has often been provided in the form of accompanying publications, machinereadable metadata, both containing the bibliographic information and documenting the corpus data, has many advantages. Metadata standards allow for the development of common tools and interfaces. In this paper I want to add a new perspective from an archive’s point of view and look at the metadata provided for four learner corpora and discuss the suitability of established standards for machine-readable metadata. I am are aware that there is ongoing work towards metadata standards for learner corpora. However, I would like to keep the discussion going and add another point of view: increasing findability and reusability of learner corpora in an archiving context.

Export metadata

Additional Services

Share in Twitter Search Google Scholar


Author:Herbert LangeORCiDGND
Parent Title (English):Proceedings of the 11th workshop on natural language processing for computer-assisted language learning (NLP4CALL 2022)
Series (Serial Number):Linköping Electronic Conference Proceedings (190)
NEALT Proceedings Series (47)
Publisher:LiU Electronic Press
Place of publication:Linköping
Editor:David Alfter, Elena Volodina, Thomas François, Piet Desmet, Frederik Cornillie, Arne Jönsson, Evelina Rennes
Document Type:Conference Proceeding
Year of first Publication:2022
Date of Publication (online):2023/01/16
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Tag:FAIR; learner corpora; metadata; metadata standards
GND Keyword:Annotation; Archivierung; Computerlinguistik; Datensatz; Dokumentation; Korpus <Linguistik>; Metadaten
First Page:108
Last Page:113
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Program areas:S2: Forschungskoordination und –infrastrukturen
Licence (English):License LogoCreative Commons - Attribution 4.0 International