Korpuslinguistik
Refine
Document Type
- Part of a Book (2)
- Conference Proceeding (2)
Has Fulltext
- yes (4)
Is part of the Bibliography
- yes (4) (remove)
Keywords
- Korpus <Linguistik> (3)
- Sprachstatistik (2)
- Annotation (1)
- Automatische Sprachanalyse (1)
- Deutsch (1)
- Fallstudie (1)
- Fugenelement (1)
- Grammatik (1)
- Komposition <Wortbildung> (1)
- Kompositum (1)
Publicationstate
- Veröffentlichungsversion (2)
- Erstveröffentlichung (1)
- Postprint (1)
- Zweitveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (3)
- Peer-Review (1)
Publisher
- Association for Computational Linguistics (1)
- Charles University (1)
- Narr (1)
- Wilhelm Fink (1)
Gegenstand des Beitrags sind korpuslinguistische Zugänge zur Variation im Auftreten des Fugenelements in Komposita aus zwei Nomen (Arbeit I s I weg). Die qualitative Vorstudie zeigt, dass die Verfügung nach Erstglied auf Vokal (Bühne I n I spiel, See I ufer) entgegen manchen Hinweisen aus bisherigen Korpusuntersuchungen sehr weitgehend linguistisch systematisierbar ist. Die Hauptstudie fokussiert dann die sehr variable Verfügung nach Erstglied auf Konsonant (Arbeit I s I weg vs. Heimat I art). Sie modelliert statistisch den Einfluss von Größen, deren Bedeutung in der bisherigen Forschung nur angenommen, aber nicht überprüft werden konnte. Dabei führt sie auch neue Einflussgrößen ein und gibt deutliche Hinweise darauf, dass die Variation in größerem Ausmaß als bisher vermutet einzelfallspezifisch geregelt ist.
In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.