Unlocking the corpus. Enriching metadata with state-of-the-art NLP methodology and linked data
- In research data management, metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles. Metadata in catalogues and registries are usually recorded either by archivists or subject matter experts, i.e. researchers involved in the creation or assembling of the data, or provided in the data preparation workflow. Extracting metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. If we look at descriptive metadata from a large corpus of newspapers, the basic metadata may consist of information, for example, about the title, or year of publication. Our approach is to add semantic metadata on the text level to facilitate the search over data. We show how to enrich metadata with three methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described using certain keywords or to identify people, places, and organisations mentioned in texts without actually having to read them.
| Author: | Jennifer EckerORCiDGND, Stefan FischerORCiD, Pia SchwarzORCiD, Thorsten TrippelORCiDGND, Antonina WerthmannORCiDGND, Rebecca WilmORCiDGND |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-134414 |
| DOI: | https://doi.org/10.3384/ecp216.11 |
| ISBN: | 978-91-8118-188-3 |
| ISSN: | 1650-3740 |
| Parent Title (English): | Selected papers from the CLARIN Annual Conference 2024. Barcelona, Spain, 15–17 October 2024 |
| Series (Serial Number): | Linköping Electronic Conference Proceedings (216) |
| Publisher: | Linköping University Electronic Press |
| Place of publication: | Linköping |
| Editor: | Vincent VandeghinsteORCiD, Thalassia KontinoORCiD |
| Document Type: | Part of a Book |
| Language: | English |
| Year of first Publication: | 2025 |
| Date of Publication (online): | 2025/09/15 |
| Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | Peer-Review |
| Tag: | DeReKo; German Reference Corpus; NER; Named Entity Recognition; keyword extraction; knowledge base; semantic metadata enrichment; topic modeling |
| GND Keyword: | Automatische Sprachanalyse; Computerlinguistik; Datenmanagement; Korpus <Linguistik>; Metadaten; Named-Entity-Recognition |
| First Page: | 127 |
| Last Page: | 140 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Linguistics-Classification: | Computerlinguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Program areas: | Digitale Sprachwissenschaft |
| Licence (English): | Creative Commons - Attribution 4.0 International |


