Volltext-Downloads (blau) und Frontdoor-Views (grau)

Building linguistic corpora from Wikipedia articles and discussions

  • Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Eliza Margaretha, Harald LüngenGND
URN:urn:nbn:de:bsz:mh39-33306
Parent Title (English):Journal of Language Technology and Computational Linguistics. Special issue on building and annotating corpora of computer-mediated communication. Issues and challenges at the interface between computational and corpus linguistics
Editor:Michael Beißwenger, Angelika Storrer, Nelleke Oostdijk, Henk van den Heuvel
Document Type:Article
Language:German
Year of first Publication:2014
Date of Publication (online):2014/12/16
Tag:Deutsches Referenzkorpus (DeReKo)
GND Keyword:Computerlinguistik; Korpus <Linguistik>; Wikipedia
Volume:29
Issue:2
First Page:59
Last Page:82
Dewey Decimal Classification:400 Sprache / 430 Deutsch
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Korpuslinguistik
Open Access?:Ja
Licence (German):License LogoCreative Commons - Namensnennung 3.0 Deutschland