Removing spam from web corpora through supervised learning using FastText

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Metadaten
Author:	Vít Suchomel
URN:	urn:nbn:de:bsz:mh39-62674
Parent Title (English):	Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017
Publisher:	Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Marc Kupietz, Harald Lüngen, Paul Rayson, Hanno Biber, Evelyn Breiteneder, Simon Clematide, John Mariani, Mark Stevenson, Theresa Sick
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2017
Date of Publication (online):	2017/07/06
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	Corpus linguistics; Corpus technology; Web corpora; Web spam
GND Keyword:	Datenmanagement; Internet; Korpus <Linguistik>; Texttechnologie
Page Number:	5
First Page:	56
Last Page:	60
DDC classes:	400 Sprache
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	CMLC-5 + BigNLP / 5th Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing
Licence (German):	Creative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0 Deutschland

Open Access