Data point selection for self-training
- Problems for parsing morphologically rich languages are, amongst others, caused by the higher variability in structure due to less rigid word order constraints and by the higher number of different lexical forms. Both properties can result in sparse data problems for statistical parsing. We present a simple approach for addressing these issues. Our approach makes use of self-training on instances selected with regard to their similarity to the annotated data. Our similarity measure is based on the perplexity of part-of-speech trigrams of new instances measured against the annotated training data. Preliminary results show that our method outperforms a self-training setting where instances are simply selected by order of occurrence in the corpus and argue that selftraining is a cheap and effective method for improving parsing accuracy for morphologically rich languages.
Author: | Ines Rehbein |
---|---|
URN: | urn:nbn:de:bsz:mh39-56043 |
URL: | https://aclanthology.info/pdf/W/W11/W11-3800.pdf |
ISBN: | 978-1-932432-73-2 |
Parent Title (English): | Proceedings of SPMRL 2011. The Second Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2011). October 6, 2011. Dublin, Ireland |
Publisher: | Association for Computational |
Place of publication: | Stroudsburg, PA |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2011 |
Date of Publication (online): | 2016/11/21 |
GND Keyword: | Automatische Sprachanalyse; Satzanalyse |
First Page: | 62 |
Last Page: | 67 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Computerlinguistik |
Licence (English): | Creative Commons - Attribution-NonCommercial-ShareAlike 3.0 Unported |