Deduplication in large web corpora

Our paper tries to find answers to some questions related to deduplication process in large-scale web-crawled corpora. An experiment based on eight corpora from the Aranea family is introduced, and first results are presented.

Metadaten
Author:	Vladimír Benko
URN:	urn:nbn:de:bsz:mh39-90221
DOI:	https://doi.org/10.14618/ids-pub-9022
Parent Title (English):	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:	Leibniz-Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2019
Date of Publication (online):	2019/07/04
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	corpus linguistics; corpus processing; deduplication; web corpora
GND Keyword:	Korpus <Linguistik>
First Page:	17
Last Page:	21
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):	Creative Commons - CC BY - Namensnennung 4.0 International

Open Access