Volltext-Downloads (blau) und Frontdoor-Views (grau)

Unlocking the corpus: enriching metadata with state-of-the-art NLP methodology and linked data

  • In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. Our approach is to add semantic metadata at the text level to facilitate the search over data. We show how to enrich metadata with three NLP methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described by certain keywords, or to identify people, places, and organisations mentioned in texts without actually having to read them and at the same time facilitate the creation of task-tailored subcorpora. To enhance this usability of the data we explore options based on the German Reference Corpus DeReKo, the largest linguistically motivated collection of German language material (Kupietz & Keibel, 2009; Kupietz et al., 2010, 2018), which contains multiple newspapers, books, transcriptions, etc., and enrich its metadata on the level of subportions, i.e. newspaper articles. We received access to a number of data files in DeReKo’s native XML format, I5. To develop the methodology, we focus on a single XML file containing all issues of one newspaper of a whole year. The following sections only give an overview of our approach, we intend, however, to provide a detailed description of the experiments and the selection of data in a subsequent longer contribution.

Export metadata

Statistics

frontdoor_oas
Metadaten
Author:Jennifer EckerORCiDGND, Stefan Fischer, Pia SchwarzORCiD, Thorsten TrippelORCiDGND, Antonina WerthmannGND, Rebecca WilmORCiDGND
URN:urn:nbn:de:bsz:mh39-128531
URL:https://www.clarin.eu/sites/default/files/CLARIN2024_ConferenceProceedings_final.pdf
ISSN:2773-2177
Parent Title (English):CLARIN Annual Conference Proceedings 2024. 15 – 17 October 2024, Barcelona, Spain
Publisher:CLARIN
Place of publication:Utrecht
Editor:Vincent Vandeghinste, Thalassia Kontino
Document Type:Part of a Book
Language:English
Year of first Publication:2024
Date of Publication (online):2024/10/10
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:German Reference Corpus (DeReKo); NLP; keyword extraction; semantic metadata; topic modeling
GND Keyword:Computerlinguistik; Datenmanagement; Deutsch; Korpus <Linguistik>; Metadaten; Named Entity Recognition; Natürliche Sprache; XML
First Page:181
Last Page:184
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Linguistics-Classification:Computerlinguistik
Linguistics-Classification:Korpuslinguistik
Program areas:Digitale Sprachwissenschaft
Licence (English):License LogoCreative Commons - Attribution 4.0 International