Volltext-Downloads (blau) und Frontdoor-Views (grau)

There’s no Data like More Data? Revisiting the Impact of Data Size on a Classification Task

  • In the paper we investigate the impact of data size on a Word Sense Disambiguation task (WSD). We question the assumption that the knowledge acquisition bottleneck, which is known as one of the major challenges for WSD, can be solved by simply obtaining more and more training data. Our case study on 1,000 manually annotated instances of the German verb drohen (threaten) shows that the best performance is not obtained when training on the full data set, but by carefully selecting new training instances with regard to their informativeness for the learning process (Active Learning). We present a thorough evaluation of the impact of different sampling methods on the data sets and propose an improved method for uncertainty sampling which dynamically adapts the selection of new instances to the learning progress of the classifier, resulting in more robust results during the initial stages of learning. A qualitative error analysis identifies problems for automatic WSD and discusses the reasons for the great gap in performance between human annotators and our automatic WSD system.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Ines Rehbein, Josef RuppenhoferGND
URN:urn:nbn:de:bsz:mh39-52935
URL:http://www.coli.uni-saarland.de/projects/salsa/page.php?id=publications
Parent Title (English):The seventh international conference on Language Resources and Evaluation (LREC). May 19-21, 2010, Valletta, Malta.
Publisher:European Language Resources Association
Document Type:Conference Proceeding
Language:English
Year of first Publication:2010
Date of Publication (online):2016/09/22
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
GND Keyword:Annotation; Automatische Sprachanalyse; Dokumentverarbeitung
First Page:1206
Last Page:1213
Dewey Decimal Classification:400 Sprache / 410 Linguistik
Linguistics-Classification:Computerlinguistik
Open Access?:Ja
Licence (German):Es gilt das UrhG