TY - CHAP U1 - Konferenzveröffentlichung A1 - Fankhauser, Peter A1 - Kupietz, Marc ED - Bański, Piotr ED - Barbaresi, Adrien ED - Clematide, Simon ED - Kupietz, Marc ED - Lüngen, Harald T1 - Count-based and predictive language models for exploring DeReKo T2 - Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022 N2 - We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora. KW - language models KW - word embeddings KW - collocation analysis KW - Korpus KW - Deutsch KW - Kollokation KW - Syntagma KW - Assoziationsmaß KW - Paradigma KW - German Reference Corpus (DeReKo) KW - Deutsches Referenzkorpus (DeReKo) Y1 - 2022 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111107 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111107 UR - http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.5.pdf SN - 979-10-95546-83-2 SB - 979-10-95546-83-2 SP - 27 EP - 31 PB - European Language Resources Association (ELRA) CY - Paris ER -