Volltext-Downloads (blau) und Frontdoor-Views (grau)

A comparable Wikipedia corpus: from wiki syntax to POS tagged XML

  • To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar research, we used a set of XSLT stylesheets to transform the mediawiki anntations to XML. Furthermore, the data has been amnntated with word class information using different taggers. The outcome is a corpus with rich meta data and linguistic annotation that can be used for multilingual research in various linguistic topics.

Export metadata

Additional Services

Search Google Scholar


Author:Noah Bubenhofer, Stefanie Haupt, Horst SchwinnGND
Parent Title (English):[Arbeiten zur Mehrsprachigkeit / B] Arbeiten zur Mehrsprachigkeit = Working papers in multilingualism / Sonderforschungsbereich 538 Mehrsprachigkeit 538, Universität Hamburg
Publisher:Universität Hamburg
Place of publication:Hamburg
Document Type:Article
Year of first Publication:2011
Date of Publication (online):2016/08/22
Tag:Comparable Corpus; Multilingual Corpus; POS-Tagging; XSLT
GND Keyword:Kontrastive Grammatik; Korpus <Linguistik>; Wikipedia
First Page:141
Last Page:144
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Licence (German):License LogoUrheberrechtlich geschützt