TY - CHAP U1 - Konferenzveröffentlichung A1 - Suchomel, Vít ED - Bański, Piotr ED - Kupietz, Marc ED - Lüngen, Harald ED - Rayson, Paul ED - Biber, Hanno ED - Breiteneder, Evelyn ED - Clematide, Simon ED - Mariani, John ED - Stevenson, Mark ED - Sick, Theresa T1 - Removing spam from web corpora through supervised learning using FastText T2 - Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017 N2 - Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set. KW - Korpus KW - Internet KW - Texttechnologie KW - Datenmanagement KW - Corpus linguistics KW - Corpus technology KW - Web corpora KW - Web spam Y1 - 2017 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62674 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62674 SP - 56 EP - 60 S1 - 5 PB - Institut für Deutsche Sprache CY - Mannheim ER -