Refine
Document Type
- Conference Proceeding (3)
- Other (2)
- Part of a Book (1)
Has Fulltext
- yes (6) (remove)
Is part of the Bibliography
- yes (6) (remove)
Keywords
- Korpus <Linguistik> (4)
- Grammatik (3)
- Datensatz (2)
- Grammis (2)
- Nominalphrase (2)
- Adjektiv (1)
- Adverb (1)
- Annotation (1)
- Attribut (1)
- Attributives Partizip (1)
Publicationstate
- Veröffentlichungsversion (6) (remove)
Reviewstate
- (Verlags)-Lektorat (2)
- Peer-Review (2)
Der Datensatz enthält 10.113 Korpusbelege für Konstruktionen, in denen ein Substantiv mit einem dass-Satz oder einem zu-Infinitiv auftritt (das Versprechen, dass man sich irgendwann wiedersieht vs. das Versprechen, sich irgendwann wiederzusehen).
Die Daten wurden erhoben aus:
1. dem Korpusgrammatik-Untersuchungskorpus (Bubenhofer et al. 2014), basierend auf dem Deutschen Referenzkorpus DeReKo (Kupietz et al. 2010, 2018), Release 2017-II.
2. dem Subkorpus “Forum” des DECOW16B-Webkorpus (Schäfer & Bildhauer 2012).
In this paper, we deal with register-driven variation from a probabilistic perspective, as proposed in Schäfer, Bildhauer, Pankratz, Müller (2022). We compare two approaches to analyse this variation within HPSG. On the one hand, we consider a multiple-grammar approach and combine it with the architecture proposed in the CoreGram project Müller (2015) - discussing its advantages and disadvantages. On the other hand, we take into account a single-grammar approach and argue that it appears to be superior due to its computational efficiency and cognitive plausibility.
Der Datensatz enthält 16.604 Korpusbelege aus Nominalphrasen mit Genitiv- und von-Attributen (die Ideen zahlreicher Kinder, die Ideen von zahlreichen Kindern), wobei die Genitivattribute prä- oder postnominal erscheinen können (Mannheims Sehenswürdigkeiten, die Sehenswürdigkeiten Mannheims).
Für jeden Beleg sind Informationen zu Land, Dekade und Medium enthalten. Hinzu kommen Angaben zu Kopf- und/oder Attributslemma (z. B. Namentyp, Flexionsklasse), Gesamtphrase (z. B. Definitheit, Kasus) und Attributsphrase (z. B. Kasusdistinktion, Länge). Zahlreiche Sonderfälle sind ebenfalls annotiert (z. B. Genitiv bei nichtflektiertem Adjektiv wie Gebäck Mannheimer Bäckereien, Phrasen mit adjektivisch flektierendem Attributsnomen wie die Ideen Jugendlicher, die Ideen von Jugendlichen).
The present chapter investigates the relative order of attributive adjectives in German. Based on corpus data, our results corroborate previous findings that semantics is the most important factor in accounting for adjective order. Going beyond previous studies, we also consider coordinated structures (such as mit [[großem, verwildertem] Garten] ‘with (a) large, overgrown garden’), where both adjectives are of equal rank. While adjective order in embedded structures (mit [ schwierigem [ familiärem Hintergrund ]] ‘with (a) difficult domestic background’) can be predicted rather accurately on semantic grounds, we show that predictions can also be made for coordinated structures, albeit with lower accuracy. Using regression analysis, we examine how semantic factors interact with a number of other explanatory variables.
In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.