Why is it so difficult to compare treebanks? TIGER and TüBa-D/Z revisited
- This paper is a contribution to the ongoing discussion on treebank annotation schemes and their impact on PCFG parsing results. We provide a thorough comparison of two German treebanks: the TIGER treebank and the TüBa-D/Z. We use simple statistics on sentence length and vocabulary size, and more refined methods such as perplexity and its correlation with PCFG parsing results, as well as a Principal Components Analysis. Finally we present a qualitative evaluation of a set of 100 sentences from the TüBa- D/Z, manually annotated in the TIGER as well as in the TüBa-D/Z annotation scheme, and show that even the existence of a parallel subcorpus does not support a straightforward and easy comparison of both annotation schemes.
Author: | Ines Rehbein, Josef van Genabith |
---|---|
URN: | urn:nbn:de:bsz:mh39-57822 |
URL: | http://doras.dcu.ie/15264/ |
ISSN: | 1736-6305 |
Parent Title (English): | The Sixth International Workshop on Treebanks and Linguistic Theories (TLT ‘07). Bergen, Norway. December 7–8, 2007 |
Series (Serial Number): | NEALT Proceedings Series (1) |
Publisher: | Northern European Association for Language Technology |
Place of publication: | Tartu |
Editor: | Koenraad De Smedt, Jan Hajič, Sandra Kübler |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2007 |
Date of Publication (online): | 2017/01/13 |
Publicationstate: | Veröffentlichungsversion |
Tag: | treebanks |
GND Keyword: | Annotation; Korpus <Linguistik>; Syntaktische Analyse |
First Page: | 115 |
Last Page: | 126 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
BDSL-Classification: | Grammatik |
Linguistics-Classification: | Computerlinguistik |
Linguistics-Classification: | Korpuslinguistik |
Licence (German): | ![]() |