Refine
Year of publication
Document Type
- Doctoral Thesis (21) (remove)
Language
- English (21) (remove)
Keywords
- Korpus <Linguistik> (5)
- Deutsch (4)
- Englisch (4)
- Computerlinguistik (3)
- Syntax (3)
- Aspekt <Linguistik> (2)
- Dialog (2)
- Formale Semantik (2)
- Grammatik (2)
- Konversationsanalyse (2)
Publicationstate
Reviewstate
Publisher
- Universität Potsdam (2)
- Dublin City University (1)
- Freie Universität Berlin (1)
- Institut für Phonetik und Sprachliche Kommunikation, Ludwig Maximilians Universität München (1)
- LOT (1)
- LOT Publications (1)
- Marburg/Lahn (1)
- Radboud Universiteit Nijmegen (1)
- University of Gothenburg (1)
- University of Illinois (1)
Sentiment Analysis is the task of extracting and classifying opinionated content in natural language texts. Common subtasks are the distinction between opinionated and factual texts, the classification of polarity in opinionated texts, and the extraction of the participating entities of an opinion(-event), i.e. the source from which an opinion emanates and the target towards which it is directed. With the emerging Web 2.0 which describes the shift towards a highly user-interactive communication medium, the amount of subjective content on the World Wide Web is steadily increasing. Thus, there is a growing need for automatically processing this type of content which is provided by sentiment analysis. Both natural language processing, which is the task of providing computational methods for the analysis and representation of natural language, and machine learning, which is the task of building task-specific classification models on the basis of empirical data, may be instrumental in mastering the challenges of the automatic sentiment analysis of written text. Many problems in sentiment analysis have been proposed to be solved with machine learning methods exclusively using a fairly low-level feature design, such as bag of words, containing little linguistic information. In this thesis, we examine the effectiveness of linguistic features in various subtasks of sentiment analysis. Thus, we heavily draw from the insights gained by natural language processing. The application of linguistic features can be applied on various classification methods, be it in rule-based classification, where the linguistic features are directly encoded as a classifier, in supervised machine learning, where these features complement basic low-level features, or in bootstrapping methods, where these features form a rule-based classifier generating a labeled training set from which a supervised classifier can be trained. In this thesis, we will in particular focus on scenarios where the combination of linguistic features and machine learning methods is effective. We will look at common text classification tasks, both coarse-grained and fine-grained, and extraction tasks.
This thesis deals with expressions consisting of two noun phrases connected by a comitative preposition, referred to as comitative constructions (CCs). It focuses on CCs in Polish, with some comparisons to other languages, and provides an analysis at the morphosyntax-semantics-pragmatics interface in the paradigm of Head-Driven Phrase Structure Grammar with the integrated model-theoretic semantic framework of Lexicalized Flexible Ty2. After postulating three different readings of Polish CCs: accompanitive, conjunctive and (open and closed) inclusive, a number of semantic phenomena are discussed which provide evidence for this classification. Further examination of the data shows that all CC types behave uniformly with regard to their syntactic properties but exhibit differences regarding agreement and person, number and gender resolution. These differences have previously been explained by syntactic stipulations. This thesis argues that a syntactic approach to CCs lacks real empirical motivation and it demonstrates that some of the existing analyses are problematic for a number of empirical and / or theoretical reasons. It further offers an alternative analysis based on the assumption that all CC types have a uniform, adjunctionbased syntactic structure, and that the crucial differences between them are semantic in nature, being triggered by the meaning of the comitative preposition. The core of the proposed semantic analysis are three different logical representations of the comitative preposition, whose truth conditions allow us to make the right predictions about the different behavior of the three CC types. All other lexical components of CCs, including plural pronouns, bear in each type of CC their customary forms and meanings. Implementing this idea in a constraint-based framework whose description language incorporates a formal semantic representation language, and modeling the morphosyntactic, semantic, pragmatic and referential properties of CCs within a single grammatical paradigm, we arrive at an analysis that accounts for these expressions in a very natural way.
This dissertation investigates discourse-pragmatic differences between variably linked arguments appearing in alternating argument structure constructions in the sense of Goldberg (1995) and Kay (manuscript). The properties that are studied include givenness, pragmatic relation (topic/focus), salience of referents, animacy, and others. They derive from the literature on sentence-type constructions such as topicalization and from research on the referential properties of NP form types.
The research carried out here has multiple uses. At the most basic level, it serves as an empirical check on existing characterizations of the pragmatic properties of the relevant arguments that are the result of syntactic and semantic analysis based on introspection alone. For instance, for the epistemic raising alternation involving verbs like seem, the predicted topicality difference between the subjects of the raised and unraised constructions (Langacker 1995) could not be confirmed.
This dissertation also addresses the question what kinds of pragmatic factors, if any, are relevant to argument structure constructions. Based on the evidence of the dative alternation, it does not seem to be the case that the kind of pragmatic influences on argument structure constructions are different or limited compared to the ones found to be relevant to sentence-type constructions.
The kind of research undertaken here can also inform the syntactic and semantic analysis of constructions. In the case of the dative alternation, the discourse-pragmatic characteristics of the variably linked arguments provide evidence that Basilico’s (1998) analysis of the difference between the alternates in terms of VP-shells and a difference between thetic and categorical ‘inner’ predication, on the one hand does not account for all the data and on the other can be re-stated in pragmatic terms other than the thetic-categorical distinction.
In addition to studies of valence alternations, this dissertation also discusses various null instantiation phenomena, which provide further evidence for the need to specify discourse-pragmatic properties as part of argument structure constructions and lexical entries.
Finally, it is suggested that the use of randomly sampled corpus data and statistical modelling throughout this dissertation improves both empirical and analytical coverage.
Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations from labelled data (treebanks) and can be applied to large data sets. My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC 700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill, 2004; Cahill et al., 2005). While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures provided by the Penn-II treebank (Marcus et al., 1993). English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task. In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks.
The thesis describes a fully automatic system for the resolution of the pronouns 'it', 'this', and 'that' in English unrestricted multi-party dialog. Referential relations considered include both normal NP-antecedence as well as discourse-deictic pronouns. The thesis contains a theoretical part with a comprehensive empiricial study, and a practical part describing machine learning experiments.
This thesis investigates temporal and aspectual reference in the typologically unrelated African languages Hausa (Chadic, Afro–Asiatic) and Medumba (Grassfields Bantu). It argues that Hausa is a genuinely tenseless language and compares the interpretation of temporally unmarked sentences in Hausa to that of morphologically tenseless sentences in Medumba, where tense marking is optional and graded. The empirical behavior of the optional temporal morphemes in Medumba motivates an analysis as existential quantifiers over times and thus provides new evidence suggesting that languages vary in whether their (past) tense is pronominal or quantificational (see also Sharvit 2014). The thesis proposes for both Hausa and Medumba that the alleged future tense marker is a modal element that obligatorily combines with a prospective future shifter (which is covert in Medumba). Cross-linguistic variation in whether or not a future marker is compatible with non-future interpretation is proposed to be predictable from the aspectual architecture of the given language.
This thesis investigates temporal and aspectual reference in the typologically unrelated African languages Hausa (Chadic, Afro–Asiatic) and Medumba (Grassfields Bantu).
It argues that Hausa is a genuinely tenseless language and compares the interpretation of temporally unmarked sentences in Hausa to that of morphologically tenseless sentences in Medumba, where tense marking is optional and graded.
The empirical behavior of the optional temporal morphemes in Medumba motivates an analysis as existential quantifiers over times and thus provides new evidence suggesting that languages vary in whether their (past) tense is pronominal or quantificational (see also Sharvit 2014).
The thesis proposes for both Hausa and Medumba that the alleged future tense marker is a modal element that obligatorily combines with a prospective future shifter (which is covert in Medumba). Cross-linguistic variation in whether or not a future marker is compatible with non-future interpretation is proposed to be predictable from the aspectual architecture of the given language.
This dissertation offers a qualitative analysis of verbal interactions in German television talk shows between 1989 and 1994. It investigates how Speakers of German formulate their own and others’ affiliation to national identities and social spaces. In particular, it examines classifications of place, person, and time that include group and place names as well as grammatically complex expressions, deictic pronouns and adverbs, and certain motion verbs. In addition, repair is discussed as a resource in re-formulating identities.
This thesis describes work in three areas: grammar engineering, computer-assisted language learning and grammar learning. These three parts are connected by the concept of a grammar-based language learning application. Two types of grammars are of concern. The first we call resource grammars, extensive descriptions a natural languages. Part I focuses on this kind of grammars. The other are domain-specific or application-specific grammars. These grammars only describe a fragment of natural language that is determined by the domain of a certain application. Domain-specific grammars are relevant for Part II and Part III. Another important distinction is between humans learning a new natural language using computational grammars (Part II) and computers learning grammars from example sentences (Part III). Part I of this thesis focuses on grammar engineering and grammar testing. It describes the development and evaluation of a computational resource grammar for Latin. Latin is known for its rich morphology and free word order, both have to be handled in a computationally efficient way. A special focus is on methods how computational grammars can be evaluated using corpus data. Such an evaluation is presented for the Latin resource grammar. Part II, the central part, describes a computer-assisted language learning application based on domain-specific grammars. The language learning application demonstrates how computational grammars can be used to guide the user input and how language learning exercises can be modeled as grammars. This allows us to put computational grammars in the center of the design of language learning exercises used to help humans learn new languages. Part III, the final part, is dedicated to a method to learn domain- or application-specific grammars based on a wide-coverage grammar and small sets of example sentences. Here a computer is learning a grammar for a fragment of a natural language from example sentences, potentially without any additional human intervention. These learned grammars can be based e.g. on the Latin resource grammar described in Part II and used as domain-specific lesson grammars in the language learning application described Part II.
A central question in psycholinguistics is how the human brain processes language in real time. To answer this question, the differences between auditory and visual processing have to be considered. The present dissertation examines the extent to which event-related potentials (ERPs) in the human electroencephalogram (EEG) interact with different modes of presentation during sentence comprehension. Besides the two classical modalities, auditory and rapid serial visual presentation (RSVP), the monitoring of readers’ eye movements was chosen as a new mode of presentation. Here, the temporal paradox between neuronal ERP effects and behavioral effects in the eye movement record were of particular interest. Specifically, by concurrently measuring ERPs and eye movements in natural reading, the dissertation aimed to shed light on the counterintuitive fact that difficulties in sentence comprehension arise earlier in eye movement measures than in the corresponding neuronal ERP effects. In contrast to RSVP and the auditory modality, reading offers a parafoveal preview of upcoming words (Rayner 1998), which enables the brain to process information of words before these are fixated for the first time (in foveal vision). When the word Gegenteil in example (1) below is fixated and processed, the brain concurrently processes some information of the upcoming parafoveal words von and weiß. (1) Schwarz ist das Gegenteil von weiß. (2) Schwarz […] blau. (3) Schwarz […] nett. The parafoveal preview mostly provides orthographic (word form) information, while semantic information is not conveyed (Inhoff & Starr 2004; White 2008). Whereas word form and lexical meaning are processed simultaneously with RSVP and auditory presentation, the parafoveal preview in natural reading allows for a temporal decoupling such that word forms are processed before meaning. This is one reason for the faster information uptake in reading. The present dissertation is the first to systematically investigate the influence of the parafoveal preview in sentence processing. Participants read sentences such as in (1)-(3), in which two adjectives were either antonyms (1), semantically related non-antonyms (2), or semantically unrelated non-antonyms (3). ERPs were computed for the last fixation before the target word (the sentence-final word in 1-3), which was assumed to capture parafoveal processing, and for the first fixation on the target, that should reflect foveal processing. The results were compared to two experiments using identical stimuli with auditory and RSVP presentation, and the parafoveal preview clearly led to different ERP results. While the RSVP and auditory presentations replicated the finding of a P300 to the second antonym in (1) (Kutas & Iragui 1998; Roehm et al. 2007), there was no P300 in response to antonyms at any fixation position in natural reading. However, the dissociation of parafoveal and foveal processing in reading also made it possible to disentangle different processes underlying the N400. There was a reduced parafoveal N400 for (1,2) compared with (3), which could be attributed to the preactivation of the word forms of the expected antonyms and of semantically related non-antonyms. In foveal vision, all non-antonyms (2,3) showed an enhanced N400 compared with (1) because they were unexpected and implausible in the sentence context. This dissociation between the preactivation of a word-form and the contextual fit of a word’s meaning is impossible with the other two modes of presentation, because orthographic and semantic information become available almost at the same time and are thus processed simultaneously. Furthermore, the parafoveal N400 effect was not accompanied by changes in the duration of the corresponding fixation, whereas the foveal N400 was. Similarly, with the concurrent measurement of ERPs and eye movements, the temporal paradox described above remained, as effects in the eye movement record preceded the neuronal ERP effects. Further support for these central findings came from two additional experiments that investigated different stimuli with concurrent ERP-eye tracking measures. Altogether, the experiments revealed that the previous findings on the language-related N400 can be replicated with natural reading, but they can also be differentiated qualitatively by virtue of the characteristics of natural reading. Although the behavioral and neuronal effects mirrored one another, not every neuronal effect necessarily translates into a behavioral output. Finally, even concurrent ERP-eye tracking measures cannot resolve the temporal paradox.