Volltext-Downloads (blau) und Frontdoor-Views (grau)

Semiautomatic data generation for academic Named Entity Recognition in German text corpora

  • An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity types, an off-the-shelf NE recognizer, and an LLM. Experiments fine-tuning a BERT model with different strategies of post-processing the automatically generated data result in several NER models achieving overall F1 scores of up to 92.45%.

Download full text files

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Pia SchwarzORCiD
URN:urn:nbn:de:bsz:mh39-128423
URL:https://aclanthology.org/2024.konvens-main.20
Parent Title (English):Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024). September 10-13, 2024
Publisher:Association for Computational Linguistics
Place of publication:Wien
Editor:Pedro Henrique Luz de Araujo, Andreas Baumann, Dagmar Gromann, Brigitte Krenn, Benjamin Roth, Michael Wiegand
Document Type:Conference Proceeding
Language:English
Year of first Publication:2024
Date of Publication (online):2024/10/07
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:BERT model; academic Named Entity Recognition; data generation; named entity; text corpora
GND Keyword:Computerlinguistik; Deutsch; Großes Sprachmodell; Korpus <Linguistik>; Named Entity Recognition
First Page:173
Last Page:181
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Linguistics-Classification:Computerlinguistik
Linguistics-Classification:Korpuslinguistik
Program areas:Digitale Sprachwissenschaft
Licence (English):License LogoCreative Commons - Attribution 4.0 International