Volltext-Downloads (blau) und Frontdoor-Views (grau)

Domain adaptation with linked encyclopedic data: A case study for historical german

  • This paper outlines a proposal for the use of knowledge graphs for historical German domain adaptation. From the EncycNet project, the encyclopedia-based knowledge graph from the early 20th century was borrowed to examine whether text-based domain adaptation using the source encyclopedia’s text or graph-based adaptation produces a better domain-specific model. To evaluate the approach, a novel historical test dataset based on a second encyclopedia of the early 20th century was created. This dataset is categorized by knowledge type (factual, linguistic, lexical) with special attention paid to distinguishing simple and expert knowledge. The main finding is that, surprisingly, simple knowledge has the most potential for improvement, whereas expert knowledge lags behind. In this study, broad signals like simple definitions and word origin yielded the best results, while more specialized knowledge such as synonyms were not as effectively represented. A follow-up study was carried out in favor of simple contemporary lexical knowledge to control for historicity and text genre, where the results confirm that language models can still be enhanced by incorporating simple lexical knowledge using the proposed workflow.

Download full text files

Export metadata

Statistics

frontdoor_oas
Metadaten
Author:Thora HagenORCiD
URN:urn:nbn:de:bsz:mh39-130788
URL:https://ceur-ws.org/Vol-3834/paper92.pdf
ISSN:1613-0073
Parent Title (English):CHR 2024: Computational Humanities Research Conference, December 4 – 6, 2024, Aarhus, Denmark
Series (Serial Number):Proceedings of the Computational Humanities Research Conference (2024)
Publisher:CEUR Workshop Proceedings
Place of publication:Aachen
Editor:Wouter Haverals, Marijn Koolen, Laure Thompson
Document Type:Conference Proceeding
Language:English
Year of first Publication:2024
Date of Publication (online):2025/03/27
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:EncycNet
domain adaptation; encyclopedic data; encyclopedic knowledge; historical german; knowledge graphs; language models
GND Keyword:Semantik
First Page:443
Last Page:461
DDC classes:400 Sprache / 430 Deutsch
Open Access?:ja
BDSL-Classification:Deutsche Sprachgeschichte / Allgemeines
Linguistics-Classification:Textlinguistik / Schriftsprache
Licence (English):License LogoCreative Commons - Attribution 4.0 International