Volltext-Downloads (blau) und Frontdoor-Views (grau)

Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison

  • In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.

Export metadata

Additional Services

Share in Twitter Search Google Scholar


Author:Roland Schäfer, Felix Bildhauer
Parent Title (English):Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, Berlin,Germany, August 7-12, 2016
Publisher:Association for Computational Linguistics
Place of publication:Berlin
Document Type:Conference Proceeding
Year of first Publication:2016
Date of Publication (online):2016/09/26
Tag:Korpusvergleich; Textklassifikation; Topikmodellierung
GND Keyword:Annotation; Korpus <Linguistik>; Textlinguistik
First Page:1
Last Page:6
Dewey Decimal Classification:400 Sprache / 410 Linguistik
Leibniz-Classification:Sprache, Linguistik
Open Access?:Ja
Licence (German):Es gilt das UrhG