Building NLP resources for Dzongkha: A tagset and a tagged corpus

The search result changed since you submitted your search request. Documents might be displayed in a different sort order.

This paper describes the application of probabilistic part of speech taggers to the Dzongkha language. A tag set containing 66 tags is designed, which is based on the Penn Treebank. A training corpus of 40,247 tokens is utilized to train the model. Using the lexicon extracted from the training corpus and lexicon from the available word list, we used two statistical taggers for comparison reasons. The best result achieved was 93.1% accuracy in a 10-fold cross validation on the training set. The winning tagger was thereafter applied to annotate a 570,247 token corpus.

Metadaten
Author:	Chungku Chungku, Jurmey Rabgay, Gertrud Faaß ORCiD
URN:	urn:nbn:de:bsz:mh39-116303
Parent Title (German):	Proceedings of the Eighth Workshop on Asian Language Resouces
Publisher:	Coling 2010 Organizing Committee
Place of publication:	Beijing
Editor:	Sarmad Hussain, Virach Sornlertlamvanich, Hammam Riza
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2010
Date of Publication (online):	2023/05/03
Publishing Institution:	Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
GND Keyword:	Daten; Dzongkha; Korpus <Linguistik>; Sprachverarbeitung; Text-to-Speech
First Page:	103
Last Page:	110
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Linguistics-Classification:	Korpuslinguistik
Licence (English):	Creative Commons - Attribution-NonCommercial-ShareAlike 3.0 Unported

Open Access