Volltext-Downloads (blau) und Frontdoor-Views (grau)

Count-based and predictive language models for exploring DeReKo

  • We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Peter FankhauserGND, Marc KupietzORCiDGND
URN:urn:nbn:de:bsz:mh39-111107
URL:http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.5.pdf
ISBN:979-10-95546-83-2
Parent Title (English):Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022
Publisher:European Language Resources Association (ELRA)
Place of publication:Paris
Editor:Piotr Bański, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen
Document Type:Conference Proceeding
Language:English
Year of first Publication:2022
Date of Publication (online):2022/07/01
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:Deutsches Referenzkorpus (DeReKo)
German Reference Corpus (DeReKo); collocation analysis; language models; word embeddings
GND Keyword:Assoziationsmaß; Deutsch; Kollokation; Korpus <Linguistik>; Paradigma; Syntagma
First Page:27
Last Page:31
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Program areas:S1: Korpuslinguistik
Program areas:S2: Forschungskoordination und –infrastrukturen
Licence (English):License LogoCreative Commons - Attribution-NonCommercial 4.0 International