Volltext-Downloads (blau) und Frontdoor-Views (grau)

Deduplication in large web corpora

  • Our paper tries to find answers to some questions related to deduplication process in large-scale web-crawled corpora. An experiment based on eight corpora from the Aranea family is introduced, and first results are presented.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Vladimír Benko
URN:urn:nbn:de:bsz:mh39-90221
DOI:https://doi.org/10.14618/ids-pub-9022
Parent Title (English):Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:Leibniz-Institut für Deutsche Sprache
Place of publication:Mannheim
Editor:Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:Conference Proceeding
Language:English
Year of first Publication:2019
Date of Publication (online):2019/07/04
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:corpus linguistics; corpus processing; deduplication; web corpora
GND Keyword:Korpus <Linguistik>
First Page:17
Last Page:21
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Conferences, Workshops:Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International