Refine
Document Type
- Doctoral Thesis (6)
- Book (2)
Keywords
- Korpus <Linguistik> (8) (remove)
Publicationstate
Reviewstate
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (8) (remove)
Publisher
Redeeinleiter sind sprachliche Ausdrücke unterschiedlicher Wortarten, die relativ zur Redewiedergabe in Voran-, Mittel- oder Nachstellung stehen und eine direkte oder indirekte Redewiedergabe einleiten. Dadurch sind Redeeinleiter sehr vielfältig, womit sie sich als Untersuchungsgegenstand einer Analyse zur lexikalischen Vielfalt von Teilwortschätzen eignen.
Als Datengrundlage der vorliegenden Untersuchung dienen die manuell annotierten direkten und indirekten Redeeinleiter des Redewiedergabe-Korpus. Dieses setzt sich aus fiktionalen und nicht-fiktionalen Textausschnitten, die zwischen 1840–1920 veröffentlicht wurden, zusammen. Ziel der Analyse ist es, zu ermitteln, wie sich der Teilwortschatz der direkten und der der indirekten Redeeinleiter in ihrer lexikalischen Vielfalt voneinander unterscheiden und wie diese Unterschiede zu begründen sind. Dafür wird ein Set an quantitativen Methoden erarbeitet mit dem die lexikalische Vielfalt von Teilwortschätzen bestimmt werden kann und das in zukünftigen Untersuchungen zur lexikalischen Vielfalt als Standardrepertoire herangezogen werden kann.
Ziel dieser Arbeit war es, eine Software zu entwickeln, die quantitative und qualitative korpuslinguistische Methoden miteinander verbindet. Die Gesamtarbeit besteht daher aus zwei Teilen: einer Open-Source-Software und dem schriftlichen Teil. Der hier vorgelegte schriftliche Teil ist eine vollständige Dokumentation (Handbuch), ergänzt um eigene Publikationen, die im Rahmen des Dissertationsprojekts entstanden. In Kapitel 1.2 Korpora und beispielhafte Fragestellungen (S. 8) erfolgt eine Illustration beispielhafter Forschungsfragen anhand bereitgestellter und im Corpus- Explorer integrierter Korpora. Außerdem werden unter "?? ?? (S. ??)" Analysen mit verschiedensten prototypischen Forschungsfragen verknüpft, die sowohl quantitative als auch qualitative Perspektiven einnehmen. Der CorpusExplorer wurde besonders nutzerfreundlich gestaltet. Dabei ist die Zielgruppe der Software sehr breit defniert: Die Nutzung soll sowohl in der Forschung als auch in der Lehre möglich sein. Daher richtet sich der CorpusExplorer gleichermaßen an Studierende und Forschende mit ihren jeweils spezifschen Bedürfnissen. Die Nutzung für die Forschung zeigt sich (A) an den integrierten Artikeln sowie daran, dass (B) andere Forschende den CorpusExplorer bereits für ihre Arbeit aufgegriffen haben. Der Nutzen für die Lehre wurde mehrfach selbst erprobt und optimiert. Im Lehr-Einsatz ist es wichtig, dass Korpora mit wenigen Mausklicks analysefertig sind und verschiedene Analysen und Visualisierungen direkt genutzt werden können. Studierende erhalten so die Möglichkeit, eigenes Korpusmaterial direkt und selbst auszuwerten. Für Forschende bietet der CorpusExplorer ein sehr breites Funktionsspektrum. Im Vergleich zu anderer (öffentlich verfügbarer) korpuslinguistischer Software verfügt er aktuell über das wohl breiteste Anwendungsspektrum (51 Analysemodule (inkl. weiterentwickelter Verfahren), über 100 unterstützte Dateiformate für Im- und Export, unterschiedliche Tagger mit 69 unterstützten Sprachmodellen). Er kann so in bestehende Skripte, Toolchains und Workflows für sehr unterschiedliche Forschungsfragen integriert werden. Im CorpusExplorer wurden nicht nur bestehende Funktionen gebündelt, es wurden auch bisherige Verfahren weiterentwickelt. Hierzu zählen z. B. (1) die Entwicklung einer eigenen, an korpuslinguistischen Bedürfnissen ausgerichteten Datenbank- Struktur, (2) die Weiterentwicklung bzw. Optimierung des Verfahrens der Kookkurrenz- Analyse hin zu einer quantitativen Kookkurrenz-Analyse (keine Parameter wie Suchfenstergröße oder Suchwort nötig, Berechnung aller Kookkurrenzen zu allen Token in einem Korpus) und (3) die Verknüpfung unterschiedlicher Analyseressourcen, wie z. B. der NGram- und der Kookkurrenz-Analyse.
This thesis is a corpus linguistic investigation of the language used by young German speakers online, examining lexical, morphological, orthographic, and syntactic features and changes in language use over time. The study analyses the language in the Nottinghamer Korpus deutscher YouTube‐Sprache ("Nottingham corpus of German YouTube language", or NottDeuYTSch corpus), one of the first large corpora of German‐language comments taken from the videosharing website YouTube, and built specifically for this project. The metadatarich corpus comprises c.33 million tokens from more than 3 million comments posted underneath videos uploaded by mainstream German‐language youthorientated YouTube channels from 2008‐2018.
The NottDeuYTSch corpus was created to enable corpus linguistic approaches to studying digital German youth language (Jugendsprache), having identified the need for more specialised web corpora (see Barbaresi 2019). The methodology for compiling the corpus is described in detail in the thesis to facilitate future construction of web corpora. The thesis is situated at the intersection of Computer‐Mediated Communication (CMC) and youth language, which have been important areas of sociolinguistic scholarship since the 1980s, and explores what we can learn from a corpus‐driven, longitudinal approach to (online) youth language. To do so, the thesis uses corpus linguistic methods to analyse three main areas:
1. Lexical trends and the morphology of polysemous lexical items. For this purpose, the analysis focuses on geil, one of the most iconic and productive words in youth language, and presents a longitudinal analysis, demonstrating that usage of geil has decreased, and identifies lexical items that have emerged as potential replacements. Additionally, geil is used to analyse innovative morphological productiveness, demonstrating how different senses of geil are used as a base lexeme or affixoid in compounding and derivation.
2. Syntactic developments. The novel grammaticalization of several subordinating conjunctions into both coordinating conjunctions and discourse markers is examined. The investigation is supported by statistical analyses that demonstrate an increase in the use of non‐standard syntax over the timeframe of the corpus and compares the results with other corpora of written language.
3. Orthography and the metacommunicative features of digital writing. This analysis identifies orthographic features and strategies in the corpus, e.g. the repetition of certain emoji, and develops a holistic framework to study metacommunicative functions, such as the communication of illocutionary force, information structure, or the expression of identities. The framework unifies previous research that had focused on individual features, integrating a wide range of metacommunicative strategies within a single, robust system of analysis.
By using qualitative and computational analytical frameworks within corpus linguistic methods, the thesis identifies emergent linguistic features in digital youth language in German and sheds further light on lexical and morphosyntactic changes and trends in the language of young people over the period 2008‐2018. The study has also further developed and augmented existing analytical frameworks to widen the scope of their application to orthographic features associated with digital writing.
This thesis describes work in three areas: grammar engineering, computer-assisted language learning and grammar learning. These three parts are connected by the concept of a grammar-based language learning application. Two types of grammars are of concern. The first we call resource grammars, extensive descriptions a natural languages. Part I focuses on this kind of grammars. The other are domain-specific or application-specific grammars. These grammars only describe a fragment of natural language that is determined by the domain of a certain application. Domain-specific grammars are relevant for Part II and Part III. Another important distinction is between humans learning a new natural language using computational grammars (Part II) and computers learning grammars from example sentences (Part III). Part I of this thesis focuses on grammar engineering and grammar testing. It describes the development and evaluation of a computational resource grammar for Latin. Latin is known for its rich morphology and free word order, both have to be handled in a computationally efficient way. A special focus is on methods how computational grammars can be evaluated using corpus data. Such an evaluation is presented for the Latin resource grammar. Part II, the central part, describes a computer-assisted language learning application based on domain-specific grammars. The language learning application demonstrates how computational grammars can be used to guide the user input and how language learning exercises can be modeled as grammars. This allows us to put computational grammars in the center of the design of language learning exercises used to help humans learn new languages. Part III, the final part, is dedicated to a method to learn domain- or application-specific grammars based on a wide-coverage grammar and small sets of example sentences. Here a computer is learning a grammar for a fragment of a natural language from example sentences, potentially without any additional human intervention. These learned grammars can be based e.g. on the Latin resource grammar described in Part II and used as domain-specific lesson grammars in the language learning application described Part II.
In der atelischen an-Konstruktion im Deutschen treten die Verben mit einer an-Präpositionalphrase auf (an etwas malen/essen/basteln). Basierend auf Korpusdaten wird in diesem Band gegen die verbreitete Alternationsanalyse der an-Konstruktion als präpositionales Pendant der transitiven Verbverwendung argumentiert. Stattdessen werden an-Phrasen als ereignisinterne Modifikatoren von einstelligen Activity-Verbvarianten analysiert, die eine in der lokalen Bedeutung der Präposition wurzelnde boundary-Relation einführen. Aus empirischer Perspektive wird ein Inventar von korpusbasierten Methoden und Maßen für die Untersuchung von (vermeintlichen) Argumentalternationen vorgestellt. Schließlich wird der bisher wenig diskutierte Zusammenhang von Argumentrealisierung und Metaphernbildung diskutiert.
Die Studie wurde mit dem Wilhelm von Humboldt Preis des Jahres 2022 der Deutschen Gesellschaft für Sprachwissenschaft ausgezeichnet.
This dissertation investigates discourse-pragmatic differences between variably linked arguments appearing in alternating argument structure constructions in the sense of Goldberg (1995) and Kay (manuscript). The properties that are studied include givenness, pragmatic relation (topic/focus), salience of referents, animacy, and others. They derive from the literature on sentence-type constructions such as topicalization and from research on the referential properties of NP form types.
The research carried out here has multiple uses. At the most basic level, it serves as an empirical check on existing characterizations of the pragmatic properties of the relevant arguments that are the result of syntactic and semantic analysis based on introspection alone. For instance, for the epistemic raising alternation involving verbs like seem, the predicted topicality difference between the subjects of the raised and unraised constructions (Langacker 1995) could not be confirmed.
This dissertation also addresses the question what kinds of pragmatic factors, if any, are relevant to argument structure constructions. Based on the evidence of the dative alternation, it does not seem to be the case that the kind of pragmatic influences on argument structure constructions are different or limited compared to the ones found to be relevant to sentence-type constructions.
The kind of research undertaken here can also inform the syntactic and semantic analysis of constructions. In the case of the dative alternation, the discourse-pragmatic characteristics of the variably linked arguments provide evidence that Basilico’s (1998) analysis of the difference between the alternates in terms of VP-shells and a difference between thetic and categorical ‘inner’ predication, on the one hand does not account for all the data and on the other can be re-stated in pragmatic terms other than the thetic-categorical distinction.
In addition to studies of valence alternations, this dissertation also discusses various null instantiation phenomena, which provide further evidence for the need to specify discourse-pragmatic properties as part of argument structure constructions and lexical entries.
Finally, it is suggested that the use of randomly sampled corpus data and statistical modelling throughout this dissertation improves both empirical and analytical coverage.
Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations from labelled data (treebanks) and can be applied to large data sets. My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC 700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill, 2004; Cahill et al., 2005). While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures provided by the Penn-II treebank (Marcus et al., 1993). English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task. In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks.
This thesis consists of the following three papers that all have been published in international peer-reviewed journals:
Chapter 3: Koplenig, Alexander (2015c). The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv037]
Chapter 4: Koplenig, Alexander (2015b). Why the quantitative analysis of dia-chronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv030]
Chapter 5: Koplenig, Alexander (2015a). Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis. Published in: Corpus Linguistics and Linguistic Theory. Berlin/Boston: de Gruyter. [doi:10.1515/cllt-2014-0049]
Chapter 1 introduces the topic by describing and discussing several basic concepts relevant to the statistical analysis of corpus linguistic data. Chapter 2 presents a method to analyze diachronic corpus data and a summary of the three publications. Chapters 3 to 5 each represent one of the three publications. All papers are printed in this thesis with the permission of the publishers.