Detection of abusive language: the problem of biased datasets

We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.

Metadaten
Author:	Michael Wiegand, Josef Ruppenhofer, Thomas Kleinbauer
URN:	urn:nbn:de:bsz:mh39-90165
URL:	https://www.aclweb.org/anthology/N19-1060
ISBN:	978-1-950737-13-0
Parent Title (English):	The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Proceedings of the Conference Vol. 1. Minneapolis, Minnesota, June 2 - June 7, 2019
Publisher:	The Association for Computational Linguistics
Place of publication:	Stroudsburg, PA, USA
Editor:	Jill Burstein, Christy Doran, Thamar Solorio
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2019
Date of Publication (online):	2019/07/03
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
GND Keyword:	Automatische Sprachanalyse; Beleidigung; Schimpfwort; Verbalagression
First Page:	602
Last Page:	608
DDC classes:	400 Sprache / 400 Sprache, Linguistik / 400 Sprache
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Computerlinguistik
Program areas:	Pragmatik
Program areas:	Digitale Sprachwissenschaft
Licence (German):	Creative Commons - CC BY - Namensnennung 4.0 International

Open Access