Refine
Document Type
Language
- English (2)
Has Fulltext
- yes (2) (remove)
Is part of the Bibliography
- yes (2) (remove)
Keywords
- Annotation (1)
- Grammatik (1)
- HPSG (1)
- Head-driven phrase structure grammar (1)
- Korpus <Linguistik> (1)
- Korpusvergleich (1)
- Phrasenstrukturgrammatik (1)
- Register <Linguistik> (1)
- Textklassifikation (1)
- Textlinguistik (1)
Publicationstate
- Veröffentlichungsversion (2) (remove)
Reviewstate
- (Verlags)-Lektorat (1)
- Peer-Review (1)
In this paper, we deal with register-driven variation from a probabilistic perspective, as proposed in Schäfer, Bildhauer, Pankratz, Müller (2022). We compare two approaches to analyse this variation within HPSG. On the one hand, we consider a multiple-grammar approach and combine it with the architecture proposed in the CoreGram project Müller (2015) - discussing its advantages and disadvantages. On the other hand, we take into account a single-grammar approach and argue that it appears to be superior due to its computational efficiency and cognitive plausibility.
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.