Volltext-Downloads (blau) und Frontdoor-Views (grau)

Type- and Token-based Word Embeddings in the Digital Humanities

  • In the general perception of the NLP community, the new dynamic, context-sensitive, token-based embeddings from language models like BERT have replaced the older static, type-based embeddings like word2vec or fastText, due to their better performance. We can show that this is not the case for one area of applications for word embeddings: the abstract representation of the meaning of words in a corpus. This application is especially important for the Computational Humanities, for example in order to show the development of words or ideas. The main contribution of our papers are: 1) We offer a systematic comparison between dynamic and static embeddings in respect to word similarity. 2) We test the best method to convert token embeddings to type embeddings. 3) We contribute new evaluation datasets for word similarity in German. The main goal of our contribution is to make an evidence-based argument that research on static embeddings, which basically stopped after 2019, should be continued not only because it needs less computing power and smaller corpora, but also because for this specific set of applications their performance is on par with that of dynamic embeddings.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Anton EhrmanntrautORCiD, Thora HagenORCiD, Leonard KonleORCiD, Fotis JannidisORCiDGND
URN:urn:nbn:de:bsz:mh39-130808
URL:https://ceur-ws.org/Vol-2989/long_paper35.pdf
ISSN:1613-0073
Parent Title (English):CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The Netherlands
Series (Serial Number):Proceedings of the Computational Humanities Research Conference (2021)
Publisher:CEUR Workshop Proceedings
Place of publication:Aachen
Editor:Maud EhrmannORCiDGND, Folgert KarsdorpORCiD, Melvin WeversORCiDGND, Tara Lee AndrewsORCiDGND, Manuel BurghardtORCiDGND, Mike KestemontORCiDGND, Enrique ManjavacasORCiD, Michael PiotrowskiORCiDGND, Joris van ZundertORCiDGND
Document Type:Conference Proceeding
Language:English
Year of first Publication:2021
Date of Publication (online):2025/03/28
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:BERT; fastText; token-based word embeddings; type-based word embeddings; word embeddings
GND Keyword:Korpus <Linguistik>; Neurolinguistisches Programmieren
First Page:16
Last Page:38
DDC classes:400 Sprache / 430 Deutsch
Open Access?:ja
Linguistics-Classification:Computerlinguistik
Licence (English):License LogoCreative Commons - Attribution 4.0 International