Korpuslinguistik
Refine
Document Type
Language
- English (2)
Has Fulltext
- yes (2)
Keywords
- Annotation (2)
- Korpus <Linguistik> (2)
- Corpus annotation (1)
- Corpus query language (1)
- Corpus technology (1)
- Datenbanksystem (1)
- Korpusvergleich (1)
- Large corpora (1)
- Textklassifikation (1)
- Textlinguistik (1)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (1)
- Peer-Review (1)
Publisher
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish