Cleaning the Europarl Corpus for Linguistic Applications

We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular political group or in particular language combinations.

Metadaten
Author:	Johannes Graën, Dolores Batinić, Martin Volk
URN:	http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus-2857
ISBN:	978-3-934105-46-1
Parent Title (English):	Proceedings of the 12th Edition of the KONVENS Conference Vol. 1. Hildesheim, Germany. October 8 – 10, 2014
Publisher:	Universitätsverlag Hildesheim
Place of publication:	Hildesheim
Editor:	Josef Ruppenhofer, Gertrud Faaß
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2014
Date of Publication (online):	2017/02/27
Publicationstate:	Veröffentlichungsversion
Reviewstate:	(Verlags)-Lektorat
Tag:	corpus linguistics
GND Keyword:	Computerlinguistik; Korpus <Linguistik>
First Page:	222
Last Page:	227
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Linguistics-Classification:	Computerlinguistik
Linguistics-Classification:	Korpuslinguistik
Licence (German):	Creative Commons - Namensnennung 3.0 Deutschland

Open Access