Modelling large parallel corpora. The Zurich Parallel Corpus Collection

Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.

Metadaten
Author:	Johannes Graën, Tannon Kew, Anastassia Shaitarova, Martin Volk GND
URN:	urn:nbn:de:bsz:mh39-90207
DOI:	https://doi.org/10.14618/ids-pub-9020
Parent Title (English):	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:	Leibniz-Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2019
Date of Publication (online):	2019/07/04
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	corpus linguistics; corpus management; parallel corpora
GND Keyword:	Korpus <Linguistik>
First Page:	1
Last Page:	8
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):	Creative Commons - CC BY - Namensnennung 4.0 International

Open Access