There’s no Data like More Data? Revisiting the Impact of Data Size on a Classification Task

In the paper we investigate the impact of data size on a Word Sense Disambiguation task (WSD). We question the assumption that the knowledge acquisition bottleneck, which is known as one of the major challenges for WSD, can be solved by simply obtaining more and more training data. Our case study on 1,000 manually annotated instances of the German verb drohen (threaten) shows that the best performance is not obtained when training on the full data set, but by carefully selecting new training instances with regard to their informativeness for the learning process (Active Learning). We present a thorough evaluation of the impact of different sampling methods on the data sets and propose an improved method for uncertainty sampling which dynamically adapts the selection of new instances to the learning progress of the classifier, resulting in more robust results during the initial stages of learning. A qualitative error analysis identifies problems for automatic WSD and discusses the reasons for the great gap in performance between human annotators and our automatic WSD system.

Metadaten
Author:	Ines Rehbein, Josef Ruppenhofer GND
URN:	urn:nbn:de:bsz:mh39-52935
URL:	http://www.coli.uni-saarland.de/projects/salsa/page.php?id=publications
Parent Title (English):	The seventh international conference on Language Resources and Evaluation (LREC). May 19-21, 2010, Valletta, Malta.
Publisher:	European Language Resources Association
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2010
Date of Publication (online):	2016/09/22
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
GND Keyword:	Annotation; Automatische Sprachanalyse; Dokumentverarbeitung
First Page:	1206
Last Page:	1213
DDC classes:	400 Sprache / 410 Linguistik
Open Access?:	ja
Linguistics-Classification:	Computerlinguistik
Licence (German):	Urheberrechtlich geschützt

Open Access