Semiautomatic data generation for academic Named Entity Recognition in German text corpora
- An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity types, an off-the-shelf NE recognizer, and an LLM. Experiments fine-tuning a BERT model with different strategies of post-processing the automatically generated data result in several NER models achieving overall F1 scores of up to 92.45%.
Author: | Pia SchwarzORCiD |
---|---|
URN: | urn:nbn:de:bsz:mh39-128423 |
URL: | https://aclanthology.org/2024.konvens-main.20 |
Parent Title (English): | Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024). September 10-13, 2024 |
Publisher: | Association for Computational Linguistics |
Place of publication: | Wien |
Editor: | Pedro Henrique Luz de Araujo, Andreas Baumann, Dagmar Gromann, Brigitte Krenn, Benjamin Roth, Michael Wiegand |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2024 |
Date of Publication (online): | 2024/10/07 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | BERT model; academic Named Entity Recognition; data generation; named entity; text corpora |
GND Keyword: | Computerlinguistik; Deutsch; Großes Sprachmodell; Korpus <Linguistik>; Named Entity Recognition |
First Page: | 173 |
Last Page: | 181 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Computerlinguistik |
Linguistics-Classification: | Korpuslinguistik |
Program areas: | Digitale Sprachwissenschaft |
Licence (English): | Creative Commons - Attribution 4.0 International |