Refine
Document Type
- Doctoral Thesis (3)
- Article (1)
Has Fulltext
- yes (4)
Keywords
- Computerlinguistik (4) (remove)
Publicationstate
Reviewstate
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (4) (remove)
Publisher
This thesis is a corpus linguistic investigation of the language used by young German speakers online, examining lexical, morphological, orthographic, and syntactic features and changes in language use over time. The study analyses the language in the Nottinghamer Korpus deutscher YouTube‐Sprache ("Nottingham corpus of German YouTube language", or NottDeuYTSch corpus), one of the first large corpora of German‐language comments taken from the videosharing website YouTube, and built specifically for this project. The metadatarich corpus comprises c.33 million tokens from more than 3 million comments posted underneath videos uploaded by mainstream German‐language youthorientated YouTube channels from 2008‐2018.
The NottDeuYTSch corpus was created to enable corpus linguistic approaches to studying digital German youth language (Jugendsprache), having identified the need for more specialised web corpora (see Barbaresi 2019). The methodology for compiling the corpus is described in detail in the thesis to facilitate future construction of web corpora. The thesis is situated at the intersection of Computer‐Mediated Communication (CMC) and youth language, which have been important areas of sociolinguistic scholarship since the 1980s, and explores what we can learn from a corpus‐driven, longitudinal approach to (online) youth language. To do so, the thesis uses corpus linguistic methods to analyse three main areas:
1. Lexical trends and the morphology of polysemous lexical items. For this purpose, the analysis focuses on geil, one of the most iconic and productive words in youth language, and presents a longitudinal analysis, demonstrating that usage of geil has decreased, and identifies lexical items that have emerged as potential replacements. Additionally, geil is used to analyse innovative morphological productiveness, demonstrating how different senses of geil are used as a base lexeme or affixoid in compounding and derivation.
2. Syntactic developments. The novel grammaticalization of several subordinating conjunctions into both coordinating conjunctions and discourse markers is examined. The investigation is supported by statistical analyses that demonstrate an increase in the use of non‐standard syntax over the timeframe of the corpus and compares the results with other corpora of written language.
3. Orthography and the metacommunicative features of digital writing. This analysis identifies orthographic features and strategies in the corpus, e.g. the repetition of certain emoji, and develops a holistic framework to study metacommunicative functions, such as the communication of illocutionary force, information structure, or the expression of identities. The framework unifies previous research that had focused on individual features, integrating a wide range of metacommunicative strategies within a single, robust system of analysis.
By using qualitative and computational analytical frameworks within corpus linguistic methods, the thesis identifies emergent linguistic features in digital youth language in German and sheds further light on lexical and morphosyntactic changes and trends in the language of young people over the period 2008‐2018. The study has also further developed and augmented existing analytical frameworks to widen the scope of their application to orthographic features associated with digital writing.
The thesis describes a fully automatic system for the resolution of the pronouns 'it', 'this', and 'that' in English unrestricted multi-party dialog. Referential relations considered include both normal NP-antecedence as well as discourse-deictic pronouns. The thesis contains a theoretical part with a comprehensive empiricial study, and a practical part describing machine learning experiments.
Sentiment Analysis is the task of extracting and classifying opinionated content in natural language texts. Common subtasks are the distinction between opinionated and factual texts, the classification of polarity in opinionated texts, and the extraction of the participating entities of an opinion(-event), i.e. the source from which an opinion emanates and the target towards which it is directed. With the emerging Web 2.0 which describes the shift towards a highly user-interactive communication medium, the amount of subjective content on the World Wide Web is steadily increasing. Thus, there is a growing need for automatically processing this type of content which is provided by sentiment analysis. Both natural language processing, which is the task of providing computational methods for the analysis and representation of natural language, and machine learning, which is the task of building task-specific classification models on the basis of empirical data, may be instrumental in mastering the challenges of the automatic sentiment analysis of written text. Many problems in sentiment analysis have been proposed to be solved with machine learning methods exclusively using a fairly low-level feature design, such as bag of words, containing little linguistic information. In this thesis, we examine the effectiveness of linguistic features in various subtasks of sentiment analysis. Thus, we heavily draw from the insights gained by natural language processing. The application of linguistic features can be applied on various classification methods, be it in rule-based classification, where the linguistic features are directly encoded as a classifier, in supervised machine learning, where these features complement basic low-level features, or in bootstrapping methods, where these features form a rule-based classifier generating a labeled training set from which a supervised classifier can be trained. In this thesis, we will in particular focus on scenarios where the combination of linguistic features and machine learning methods is effective. We will look at common text classification tasks, both coarse-grained and fine-grained, and extraction tasks.
Schreiben und Redigieren stellen hohe kognitive Anforderungen an Autoren. Selbst publizierte Texte sind nie ganz fehlerfrei. Für viele Fehler kann man die Entstehung rekonstruieren: Funktionen in Textbearbeitungsprogrammen sind zeichenbasiert und berücksichtigen nicht die Elemente und Strukturen der jeweiligen verwendeten Sprache. Autoren müssen ihre Redigierabsichten in eine lange, komplexe Folge solcher zeichenbasierten Funktionen übersetzen.
Editoren für Programmierer hingegen bieten seit langem sprachspezifische Editierfunktionen, die auf den Elementen und Strukturen der verwendeten Programmiersprache operieren. Diese Funktionen tragen dazu bei, das Ändern von Programmcode zu erleichtern und Fehler zu vermeiden.
In dieser Arbeit übertragen wir das Prinzip solcher sprachspezifischen Funktionen in Programmiereditoren auf Funktionen für die Bearbeitung natürlichsprachlicher Texte. Wir entwickeln das Konzept der linguistisch unterstützten Redigierfunktionen unter Berücksichtigung aktueller Erkenntnisse der Schreibforschung. Wir definieren Informations-, Bewegungs- und Modifikationsfunktionen, die auf Elementen und Strukturen natürlicher Sprache operieren. Solche Funktionen sollen Autoren entlasten und helfen, typische Fehler zu vermeiden.
Sprachspezifische Funktionen beruhen auf Methoden zur Erkennung und Bestimmung relevanter Elemente und Strukturen. Wir verwenden dazu computerlinguistische Ressourcen zur morphologischen Analyse und Generierung und zur automatischen Wortartenbestimmung. Die Evaluation verfügbarer Ressourcen ergibt, dass die Situation für die Behandlung des Deutschen nicht so vielversprechend ist, wie ursprünglich angenommen und üblicherweise in der Literatur dargestellt.
Unsere prototypische Implementierung linguistisch unterstützter Redigierfunktionen für die Bearbeitung deutscher Texte zeigt die Möglichkeiten und Grenzen des Konzepts unter Berücksichtigung der Leistungsfähigkeit heute verfügbarer computerlinguistischer Ressourcen und der Eigenschaften des Deutschen.