Volltext-Downloads (blau) und Frontdoor-Views (grau)

Building NLP resources for Dzongkha: A tagset and a tagged corpus

  • This paper describes the application of probabilistic part of speech taggers to the Dzongkha language. A tag set containing 66 tags is designed, which is based on the Penn Treebank. A training corpus of 40,247 tokens is utilized to train the model. Using the lexicon extracted from the training corpus and lexicon from the available word list, we used two statistical taggers for comparison reasons. The best result achieved was 93.1% accuracy in a 10-fold cross validation on the training set. The winning tagger was thereafter applied to annotate a 570,247 token corpus.

Export metadata

Additional Services

Search Google Scholar


Author:Chungku Chungku, Jurmey Rabgay, Gertrud FaaßORCiD
Parent Title (German):Proceedings of the Eighth Workshop on Asian Language Resouces
Publisher:Coling 2010 Organizing Committee
Place of publication:Beijing
Editor:Sarmad Hussain, Virach Sornlertlamvanich, Hammam Riza
Document Type:Conference Proceeding
Year of first Publication:2010
Date of Publication (online):2023/05/03
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
GND Keyword:Daten; Dzongkha; Korpus <Linguistik>; Sprachverarbeitung; Text-to-Speech
First Page:103
Last Page:110
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Licence (English):License LogoCreative Commons - Attribution-NonCommercial-ShareAlike 3.0 Unported