Volltext-Downloads (blau) und Frontdoor-Views (grau)
The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 25 of 687
Back to Result List

Deduplication in large web corpora

  • Our paper tries to find answers to some questions related to deduplication process in large-scale web-crawled corpora. An experiment based on eight corpora from the Aranea family is introduced, and first results are presented.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Vladimír Benko
URN:urn:nbn:de:bsz:mh39-90221
DOI:https://doi.org/10.14618/ids-pub-9022
Parent Title (English):Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:Leibniz-Institut für Deutsche Sprache
Place of publication:Mannheim
Editor:Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:Conference Proceeding
Language:English
Year of first Publication:2019
Date of Publication (online):2019/07/04
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:corpus linguistics; corpus processing; deduplication; web corpora
GND Keyword:Korpus <Linguistik>
First Page:17
Last Page:21
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Conferences, Workshops:Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International