Internet Corpora: A Challenge for Linguistic Processing

Natural language Processing tools are mostly developed for and optimized on newspaper texts, and often Show a substantial performance drop when applied to other types of texts such as Twitter feeds, Chat data or Internet forum posts. We explore a range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts. Our results show that these methods can improve tagger performance substantially.

Metadaten
Author:	Andrea Horbach, Stefan Thater, Diana Steffen, Peter M. Fischer, Andreas Witt ORCiD GND, Manfred Pinkal
URN:	urn:nbn:de:bsz:mh39-43565
DOI:	https://doi.org/10.1007/s13222-014-0172-z
ISSN:	1618-2162
Parent Title (German):	Datenbank-Spektrum
Document Type:	Article
Language:	English
Year of first Publication:	2015
Date of Publication (online):	2015/11/11
Reviewstate:	Peer-Review
Tag:	Computer-mediated communication; Natural language processing; Part-of-speech tagging
GND Keyword:	Automatische Sprachanalyse; Internet; Korpus <Linguistik>; Natürliche Sprache
Volume:	15
Issue:	1
First Page:	41
Last Page:	47
Note:	Dieser Beitrag ist aus urheberrechtlichen Gründen nicht frei zugänglich.
DDC classes:	400 Sprache / 410 Linguistik
Open Access?:	nein
Licence (German):	Urheberrechtlich geschützt

Open Access