Refine
Year of publication
- 2016 (1) (remove)
Document Type
Language
- English (1)
Has Fulltext
- yes (1)
Is part of the Bibliography
- yes (1)
Keywords
- Annotation (1)
- Korpus <Linguistik> (1)
- Korpusvergleich (1)
- Textklassifikation (1)
- Textlinguistik (1)
- Topikmodellierung (1)
Publicationstate
Reviewstate
Publisher
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.