A harmonised testsuite for POS tagging of German social media data

We present a testsuite for POS tagging German web data. Our testsuite provides the original raw text as well as the gold tokenisations and is annotated for parts-of-speech. The testsuite includes a new dataset for German tweets, with a current size of 3,940 tokens. To increase the size of the data, we harmonised the annotations in already existing web corpora, based on the Stuttgart-Tübingen Tag Set. The current version of the corpus has an overall size of 48,344 tokens of web data, around half of it from Twitter. We also present experiments, showing how different experimental setups (training set size, additional out-of-domain training data, self-training) influence the accuracy of the taggers. All resources and models will be made publicly available to the research community.

Metadaten
Author:	Ines Rehbein, Josef Ruppenhofer, Victor Zimmermann
URN:	urn:nbn:de:bsz:mh39-79318
URL:	https://konvens.org/proceedings/2018/PDF/konvens18_00.pdf
Parent Title (English):	Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018). September 19-21, 2018 Vienna, Austria
Publisher:	Austrian academy of sciences
Place of publication:	Vienna, Austria
Editor:	Adrien Barbaresi, Hanno Biber, Friedrich Neubarth, Rainer Osswald
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2018
Date of Publication (online):	2018/09/20
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	Part-of-Speech-Tagging
GND Keyword:	Deutsch; Korpus <Linguistik>; Soziale Software
First Page:	18
Last Page:	28
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Korpuslinguistik
Program areas:	Pragmatik
Program areas:	Digitale Sprachwissenschaft
Licence (German):	Urheberrechtlich geschützt

Open Access