OPUS 4 | Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019

Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019

2 search hits

1 to 2

Sort by

Year
Year
Title
Title
Author
Author

Modelling large parallel corpora. The Zurich Parallel Corpus Collection (2019)

Graën, Johannes ; Kew, Tannon ; Shaitarova, Anastassia ; Volk, Martin

Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.

Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22 July 2019 (2019)

Contents: 1. Johannes Graën, Tannon Kew, Anastassia Shaitarova and Martin Volk, "Modelling Large Parallel Corpora", S. 1-8 2. Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary, "Asynchronous Pipelines for Processing Huge Corpora on Medium to Low Resource Infrastructures", S. 9-16 3. Vladimír Benko, "Deduplication in Large Web Corpora", S. 17-22 4. Mark Davies, "The best of both worlds: Multi-billion word “dynamic” corpora", S. 23-28 5. Adrien Barbaresi, "On the need for domain-focused web corpora", S. 29-32 6. Marc Kupietz, Eliza Margaretha, Nils Diewald, Harald Lüngen and Peter Fankhauser, "What's New in EuReCo? Interoperability, Comparable Corpora, Licensing", S. 33-39

1 to 2

Open Access

Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

2 search hits