Detection of abusive language: the problem of biased datasets
- We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.
Author: | Michael Wiegand, Josef Ruppenhofer, Thomas Kleinbauer |
---|---|
URN: | urn:nbn:de:bsz:mh39-90165 |
URL: | https://www.aclweb.org/anthology/N19-1060 |
ISBN: | 978-1-950737-13-0 |
Parent Title (English): | The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Proceedings of the Conference Vol. 1. Minneapolis, Minnesota, June 2 - June 7, 2019 |
Publisher: | The Association for Computational Linguistics |
Place of publication: | Stroudsburg, PA, USA |
Editor: | Jill Burstein, Christy Doran, Thamar Solorio |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2019 |
Date of Publication (online): | 2019/07/03 |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
GND Keyword: | Automatische Sprachanalyse; Beleidigung; Schimpfwort; Verbalagression |
First Page: | 602 |
Last Page: | 608 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik / 400 Sprache |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Computerlinguistik |
Licence (German): | ![]() |