TY - CHAP U1 - Konferenzveröffentlichung A1 - Benko, Vladimír ED - Bański, Piotr ED - Barbaresi, Adrien ED - Biber, Hanno ED - Breiteneder, Evelyn ED - Clematide, Simon ED - Kupietz, Marc ED - Lüngen, Harald ED - Iliadi, Caroline T1 - Deduplication in large web corpora T2 - Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 N2 - Our paper tries to find answers to some questions related to deduplication process in large-scale web-crawled corpora. An experiment based on eight corpora from the Aranea family is introduced, and first results are presented. KW - corpus linguistics KW - deduplication KW - corpus processing KW - web corpora KW - Korpus Y1 - 2019 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-90221 U6 - https://doi.org/10.14618/ids-pub-9022 DO - https://doi.org/10.14618/ids-pub-9022 SP - 17 EP - 21 PB - Leibniz-Institut für Deutsche Sprache CY - Mannheim ER -