Volltext-Downloads (blau) und Frontdoor-Views (grau)
The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 80 of 1765
Back to Result List

Introducing DeReKoGram: A novel frequency dataset with lemma and part-of-speech information for German

  • We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Sascha WolferORCiDGND, Alexander KoplenigORCiDGND, Marc KupietzORCiDGND, Carolin Müller-SpitzerORCiDGND
URN:urn:nbn:de:bsz:mh39-122939
DOI:https://doi.org/10.3390/data8110170
Parent Title (English):Data
Publisher:MDPI
Place of publication:Basel
Document Type:Article
Language:English
Year of first Publication:2023
Date of Publication (online):2023/11/13
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:German; corpus frequency; dataset; language; n-grams; vocabulary growth
GND Keyword:Datenanalyse; Deutsch; Korpus <Linguistik>; Lemma; Sprachdaten; Wortschatz
Volume:8
Issue:11
Page Number:10
Note:
Gefördert durch den Zeitschriftenfonds der Leibniz-Gemeinschaft
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Program areas:L3: Lexik empirisch und digital
Program areas:S1: Korpuslinguistik
Licence (English):License LogoCreative Commons - Attribution 4.0 International