TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Margaretha, Eliza A1 - Lüngen, Harald ED - Beißwenger, Michael ED - Storrer, Angelika ED - Oostdijk, Nelleke ED - van den Heuvel, Henk T1 - Building linguistic corpora from Wikipedia articles and discussions JF - Journal of Language Technology and Computational Linguistics. Special issue on building and annotating corpora of computer-mediated communication. Issues and challenges at the interface between computational and corpus linguistics N2 - Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings. KW - Wikipedia KW - Korpus KW - Computerlinguistik KW - Deutsches Referenzkorpus (DeReKo) Y1 - 2014 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-33306 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-33306 VL - 29 IS - 2 SP - 59 EP - 82 ER -