Building NLP resources for Dzongkha: A tagset and a tagged corpus
- This paper describes the application of probabilistic part of speech taggers to the Dzongkha language. A tag set containing 66 tags is designed, which is based on the Penn Treebank. A training corpus of 40,247 tokens is utilized to train the model. Using the lexicon extracted from the training corpus and lexicon from the available word list, we used two statistical taggers for comparison reasons. The best result achieved was 93.1% accuracy in a 10-fold cross validation on the training set. The winning tagger was thereafter applied to annotate a 570,247 token corpus.
Author: | Chungku Chungku, Jurmey Rabgay, Gertrud FaaßORCiD |
---|---|
URN: | urn:nbn:de:bsz:mh39-116303 |
Parent Title (German): | Proceedings of the Eighth Workshop on Asian Language Resouces |
Publisher: | Coling 2010 Organizing Committee |
Place of publication: | Beijing |
Editor: | Sarmad Hussain, Virach Sornlertlamvanich, Hammam Riza |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2010 |
Date of Publication (online): | 2023/05/03 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
GND Keyword: | Daten; Dzongkha; Korpus <Linguistik>; Sprachverarbeitung; Text-to-Speech |
First Page: | 103 |
Last Page: | 110 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Korpuslinguistik |
Licence (English): | Creative Commons - Attribution-NonCommercial-ShareAlike 3.0 Unported |