Volltext-Downloads (blau) und Frontdoor-Views (grau)

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures

  • Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent RomaryGND
URN:urn:nbn:de:bsz:mh39-90215
DOI:https://doi.org/10.14618/ids-pub-9021
Parent Title (English):Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:Leibniz-Institut für Deutsche Sprache
Place of publication:Mannheim
Editor:Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:Conference Proceeding
Language:English
Year of first Publication:2019
Date of Publication (online):2019/07/04
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:corpus infrastructures; corpus linguistics; corpus management; corpus processing
GND Keyword:Korpus <Linguistik>
First Page:9
Last Page:16
Dewey Decimal Classification:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Conferences, Workshops:Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International