Volltext-Downloads (blau) und Frontdoor-Views (grau)

Unlocking the corpus. Enriching metadata with state-of-the-art NLP methodology and linked data

  • In research data management, metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles. Metadata in catalogues and registries are usually recorded either by archivists or subject matter experts, i.e. researchers involved in the creation or assembling of the data, or provided in the data preparation workflow. Extracting metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. If we look at descriptive metadata from a large corpus of newspapers, the basic metadata may consist of information, for example, about the title, or year of publication. Our approach is to add semantic metadata on the text level to facilitate the search over data. We show how to enrich metadata with three methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described using certain keywords or to identify people, places, and organisations mentioned in texts without actually having to read them.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Jennifer EckerORCiDGND, Stefan FischerORCiD, Pia SchwarzORCiD, Thorsten TrippelORCiDGND, Antonina WerthmannORCiDGND, Rebecca WilmORCiDGND
URN:urn:nbn:de:bsz:mh39-134414
DOI:https://doi.org/10.3384/ecp216.11
ISBN:978-91-8118-188-3
ISSN:1650-3740
Parent Title (English):Selected papers from the CLARIN Annual Conference 2024. Barcelona, Spain, 15–17 October 2024
Series (Serial Number):Linköping Electronic Conference Proceedings (216)
Publisher:Linköping University Electronic Press
Place of publication:Linköping
Editor:Vincent VandeghinsteORCiD, Thalassia KontinoORCiD
Document Type:Part of a Book
Language:English
Year of first Publication:2025
Date of Publication (online):2025/09/15
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:DeReKo; German Reference Corpus; NER; Named Entity Recognition; keyword extraction; knowledge base; semantic metadata enrichment; topic modeling
GND Keyword:Automatische Sprachanalyse; Computerlinguistik; Datenmanagement; Korpus <Linguistik>; Metadaten; Named-Entity-Recognition
First Page:127
Last Page:140
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Linguistics-Classification:Computerlinguistik
Linguistics-Classification:Korpuslinguistik
Program areas:Digitale Sprachwissenschaft
Licence (English):License LogoCreative Commons - Attribution 4.0 International