TY  - CHAP
U1  - Konferenzveröffentlichung
A1  - Ortiz Suárez, Pedro Javier
A1  - Sagot, Benoît
A1  - Romary, Laurent
ED  - Bański, Piotr
ED  - Barbaresi, Adrien
ED  - Biber, Hanno
ED  - Breiteneder, Evelyn
ED  - Clematide, Simon
ED  - Kupietz, Marc
ED  - Lüngen, Harald
ED  - Iliadi, Caroline
T1  - Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures
T2  - Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
N2  - Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
KW  - corpus linguistics
KW  - corpus infrastructures
KW  - corpus processing
KW  - corpus management
KW  - Korpus <Linguistik>
Y1  - 2019
UN  - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-90215
U6  - https://doi.org/10.14618/ids-pub-9021
DO  - https://doi.org/10.14618/ids-pub-9021
SP  - 9
EP  - 16
PB  - Leibniz-Institut für Deutsche Sprache
CY  - Mannheim
ER  -