TY  - CHAP
U1  - Konferenzveröffentlichung
A1  - Graën, Johannes
A1  - Kew, Tannon
A1  - Shaitarova, Anastassia
A1  - Volk, Martin
ED  - Bański, Piotr
ED  - Barbaresi, Adrien
ED  - Biber, Hanno
ED  - Breiteneder, Evelyn
ED  - Clematide, Simon
ED  - Kupietz, Marc
ED  - Lüngen, Harald
ED  - Iliadi, Caroline
T1  - Modelling large parallel corpora. The Zurich Parallel Corpus Collection
T2  - Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
N2  - Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.
KW  - corpus linguistics
KW  - parallel corpora
KW  - corpus management
KW  - Korpus <Linguistik>
Y1  - 2019
UN  - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-90207
U6  - https://doi.org/10.14618/ids-pub-9020
DO  - https://doi.org/10.14618/ids-pub-9020
SP  - 1
EP  - 8
PB  - Leibniz-Institut für Deutsche Sprache
CY  - Mannheim
ER  -