Type- and Token-based Word Embeddings in the Digital Humanities
- In the general perception of the NLP community, the new dynamic, context-sensitive, token-based embeddings from language models like BERT have replaced the older static, type-based embeddings like word2vec or fastText, due to their better performance. We can show that this is not the case for one area of applications for word embeddings: the abstract representation of the meaning of words in a corpus. This application is especially important for the Computational Humanities, for example in order to show the development of words or ideas. The main contribution of our papers are: 1) We offer a systematic comparison between dynamic and static embeddings in respect to word similarity. 2) We test the best method to convert token embeddings to type embeddings. 3) We contribute new evaluation datasets for word similarity in German. The main goal of our contribution is to make an evidence-based argument that research on static embeddings, which basically stopped after 2019, should be continued not only because it needs less computing power and smaller corpora, but also because for this specific set of applications their performance is on par with that of dynamic embeddings.
| Author: | Anton EhrmanntrautORCiD, Thora HagenORCiD, Leonard KonleORCiD, Fotis JannidisORCiDGND |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-130808 |
| URL: | https://ceur-ws.org/Vol-2989/long_paper35.pdf |
| ISSN: | 1613-0073 |
| Parent Title (English): | CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The Netherlands |
| Series (Serial Number): | Proceedings of the Computational Humanities Research Conference (2021) |
| Publisher: | CEUR Workshop Proceedings |
| Place of publication: | Aachen |
| Editor: | Maud EhrmannORCiDGND, Folgert KarsdorpORCiD, Melvin WeversORCiDGND, Tara Lee AndrewsORCiDGND, Manuel BurghardtORCiDGND, Mike KestemontORCiDGND, Enrique ManjavacasORCiD, Michael PiotrowskiORCiDGND, Joris van ZundertORCiDGND |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of first Publication: | 2021 |
| Date of Publication (online): | 2025/03/28 |
| Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | Peer-Review |
| Tag: | BERT; fastText; token-based word embeddings; type-based word embeddings; word embeddings |
| GND Keyword: | Korpus <Linguistik>; Neurolinguistisches Programmieren |
| First Page: | 16 |
| Last Page: | 38 |
| DDC classes: | 400 Sprache / 430 Deutsch |
| Open Access?: | ja |
| Linguistics-Classification: | Computerlinguistik |
| Licence (English): | Creative Commons - Attribution 4.0 International |


