Refine
Document Type
- Conference Proceeding (2)
- Article (1)
- Working Paper (1)
Has Fulltext
- yes (4)
Keywords
- Part-of-Speech-Tagging = POS (4) (remove)
Publicationstate
Reviewstate
Die Guidelines sind eine Erweiterung des STTS (Schiller et al. 1999) für die Annotation von Transkripten gesprochener Sprache. Dieses Tagset basiert auf der Annotation des FOLK-Korpus des IDS Mannheim (Schmidt 2014) und es wurde gegenüber dem STTS erweitert in Hinblick auf typisch gesprochensprachliche Phänomene bzw. Eigenheiten der Transkription derselben. Es entstand im Rahmen des Dissertationsprojekts „POS für(s) FOLK – Entwicklung eines automatisierten Part-of-Speech-Tagging von spontansprachlichen Daten“ (Westpfahl 2017 (i.V.)).
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tübingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.