Refine
Year of publication
Document Type
- Conference Proceeding (17)
- Article (4)
- Part of a Book (3)
- Other (1)
Has Fulltext
- yes (25)
Keywords
- Korpus <Linguistik> (16)
- Deutsch (5)
- Automatische Sprachanalyse (4)
- Englisch (4)
- Sprachwandel (4)
- Deutsches Referenzkorpus (DeReKo) (3)
- Sprachvariante (3)
- Visualisierung (3)
- Automatische Sprachverarbeitung (2)
- Computerlinguistik (2)
Publicationstate
- Veröffentlichungsversion (14)
- Postprint (5)
- Zweitveröffentlichung (1)
Reviewstate
- Peer-Review (14)
- (Verlags)-Lektorat (3)
- Peer-Revied (1)
Publisher
- European Language Resources Association (ELRA) (4)
- Leibniz-Institut für Deutsche Sprache (IDS) (2)
- Zenodo (2)
- Association for Computational Linguistics (1)
- Deutsche Gesellschaft für Sprachwissenschaft (1)
- European Language Resources Association (1)
- European language resources association (ELRA) (1)
- Frontiers Media S.A. (1)
- Jagiellonian University; Pedagogical University (1)
- Lancaster University (1)
The automatic recognition of idioms poses a challenging problem for NLP applications. Whereas native speakers can intuitively handle multiword expressions whose compositional meanings are hard to trace back to individual word semantics, there is still ample scope for improvement regarding computational approaches. We assume that idiomatic constructions can be characterized by gradual intensities of semantic non-compositionality, formal fixedness, and unusual usage context, and introduce a number of measures for these characteristics, comprising count-based and predictive collocation measures together with measures of context (un)similarity. We evaluate our approach on a manually labelled gold standard, derived from a corpus of German pop lyrics. To this end, we apply a Random Forest classifier to analyze the individual contribution of features for automatically detecting idioms, and study the trade-off between recall and precision. Finally, we evaluate the classifier on an independent dataset of idioms extracted from a list of Wikipedia idioms, achieving state-of-the art accuracy.
In order to differentiate between figurative and literal usage of verb-noun combinations for the shared task on the disambiguation of German Verbal Idioms issued for KONVENS 2021, we apply and extend an approach originally developed for detecting idioms in a dataset consisting of random ngram samples. The classification is done by implementing a rather shallow, statistics-based pipeline without intensive preprocessing and examinations on the morphosyntactic and semantic level. We describe the overall approach, the differences between the original dataset and the dataset of the KONVENS task, provide experimental classification results, and analyse the individual contributions of our feature sets.
Linguistic Variation and Change in 250 Years of English Scientific Writing: A Data-Driven Approach
(2020)
We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
(2014)
We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.
We evaluate a graph-based dependency parser on DeReKo, a large corpus of contemporary German. The dependency parser is trained on the German dataset from the SPMRL 2014 Shared Task which contains text from the news domain, whereas DeReKo also covers other domains including fiction, science, and technology. To avoid the need for costly manual annotation of the corpus, we use the parser’s probability estimates for unlabeled and labeled attachment as main evaluation criterion. We show that these probability estimates are highly correlated with the actual attachment scores on a manually annotated test set. On this basis, we compare estimated parsing scores for the individual domains in DeReKo, and show that the scores decrease with increasing distance of a domain to the training corpus.
Forschungsdatenmanagement in den Geisteswissenschaften am Beispiel der germanistischen Linguistik
(2013)
Die Kernaufgabe des Instituts für Deutsche Sprache (IDS) ist die Erforschung und Dokumentation der deutschen Sprache. Dazu sammelt und archiviert das IDS einen umfangreichen Bestand an Forschungsprimärdaten in Form von Korpora der geschriebenen und gesprochenen Sprache sowie Sekundärdaten, wie zum Beispiel lexikographische Ressourcen. Dieser Beitrag gibt einen Überblick über den Datenbestand des IDS und die laufenden Forschungskooperationen im Bereich der Langzeitarchivierung. In diesem Kontext wird das im Aufbau befindliche Langzeitarchivdes IDS mit seiner Architektur, den zugrundeliegenden Prinzipien zur Daten- und Metadatenmodellierung sowie den daraus abgeleiteten Erfassungsprozessen vorgestellt. Der Beitrag schließt ab mit einem Ausblick auf die Herausforderungen und Perspektiven des Forschungsdatenmanagements aus Sicht der germanistischen Linguistik.