Korpuslinguistik
Refine
Document Type
Language
- English (2)
Has Fulltext
- yes (2)
Keywords
- Automatische Sprachanalyse (2) (remove)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (1)
- Peer-Review (1)
Publisher
- Charles University (1)
- Eigenverlag ÖGAI (1)
This paper presents an extension to the Stuttgart-Tübingen TagSet, the standard part-of-speech tag set for German, for the annotation of spoken language. The additional tags deal with hesitations, backchannel signals, interruptions, onomatopoeia and uninterpretable material. They allow one to capture phenomena specific to spoken language while, at the same time, preserving inter-operability with already existing corpora of written language.
In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.