Refine
Year of publication
Document Type
- Part of a Book (195)
- Conference Proceeding (176)
- Article (91)
- Book (31)
- Other (18)
- Working Paper (14)
- Doctoral Thesis (6)
- Preprint (3)
- Review (3)
- Master's Thesis (2)
Language
- English (279)
- German (262)
- French (2)
- Multiple languages (1)
Keywords
- Korpus <Linguistik> (544) (remove)
Publicationstate
- Veröffentlichungsversion (544) (remove)
Reviewstate
Publisher
- de Gruyter (62)
- Institut für Deutsche Sprache (43)
- IDS-Verlag (25)
- Leibniz-Institut für Deutsche Sprache (IDS) (25)
- Narr (22)
- Association for Computational Linguistics (18)
- European Language Resources Association (ELRA) (18)
- Leibniz-Institut für Deutsche Sprache (17)
- Heidelberg University Publishing (15)
- European Language Resources Association (13)
Der Beitrag zeigt, auf welch grundlegende Weise das Paradigma der Corpus-Driven-Linguistics (CDL) die linguistische Beschreibung sprachlichen Usus auf der Basis mathematisch-statistischer Clusteringverfahren bestimmt. Es soll deutlich werden, wie sich diese Prämissen im Forschungsschwerpunkt zur linguistischen Systematisierung und Interpretation von Kookkurrenzdaten manifestieren.
Arbeitet man als muttersprachlicher Sprecher des Deutschen mit Corpora gesprochener oder geschriebener deutscher Sprache, dann reflektiert man in aller Regel nur selten über die Vielzahl von kulturspezifischen Informationen, die in solchen Texten kodifiziert sind - vor allem, wenn es sich bei diesen Daten um Texte aus der Gegenwart handelt. In den meisten Fällen hat man nämlich keinerlei Probleme mit dem in den Daten präsupponierten und als allgemein bekannt erachteten Hintergrundswissen. Betrachtet man dagegen Daten in Corpora, die andere - vor allem nicht-indoeuropäische - Sprachen dokumentieren, dann wird einem schnell bewusst, wieviel an kulturspezifischem Wissen nötig ist, um diese Daten adäquat zu verstehen. In meinem Beitrag illustriere ich diese Beobachtung an einem Beispiel aus meinem Corpus des Kilivila, der austronesischen Sprache der Trobriand-Insulaner von Papua-Neuguinea. Anhand eines kurzen Ausschnitts einer insgesamt etwa 26 Minuten dauernden Dokumentation, worüber und wie sechs Trobriander miteinander tratschen und klatschen, zeige ich, was ein Hörer oder Leser eines solchen kurzen Daten-Ausschnitts wissen muss, um nicht nur dem Gespräch überhaupt folgen zu können, sondern auch um zu verstehen, was dabei abläuft und wieso ein auf den ersten Blick absolut alltägliches Gespräch plötzlich für einen Trobriander ungeheuer an Brisanz und Bedeutung gewinnt. Vor dem Hintergrund dieses Beispiels weise ich dann zum Schluss meines Beitrags darauf hin, wie unbedingt nötig und erforderlich es ist, in allen Corpora bei der Erschließung und Kommentierung von Datenmaterialien durch sogenannte Metadaten solche kulturspezifischen Informationen explizit zu machen.
"Themengebundene Verwendung(en)" als neuer Angabetyp unter der Rubrik "Besonderheiten des Gebrauchs"
(2011)
This thesis is a corpus linguistic investigation of the language used by young German speakers online, examining lexical, morphological, orthographic, and syntactic features and changes in language use over time. The study analyses the language in the Nottinghamer Korpus deutscher YouTube‐Sprache ("Nottingham corpus of German YouTube language", or NottDeuYTSch corpus), one of the first large corpora of German‐language comments taken from the videosharing website YouTube, and built specifically for this project. The metadatarich corpus comprises c.33 million tokens from more than 3 million comments posted underneath videos uploaded by mainstream German‐language youthorientated YouTube channels from 2008‐2018.
The NottDeuYTSch corpus was created to enable corpus linguistic approaches to studying digital German youth language (Jugendsprache), having identified the need for more specialised web corpora (see Barbaresi 2019). The methodology for compiling the corpus is described in detail in the thesis to facilitate future construction of web corpora. The thesis is situated at the intersection of Computer‐Mediated Communication (CMC) and youth language, which have been important areas of sociolinguistic scholarship since the 1980s, and explores what we can learn from a corpus‐driven, longitudinal approach to (online) youth language. To do so, the thesis uses corpus linguistic methods to analyse three main areas:
1. Lexical trends and the morphology of polysemous lexical items. For this purpose, the analysis focuses on geil, one of the most iconic and productive words in youth language, and presents a longitudinal analysis, demonstrating that usage of geil has decreased, and identifies lexical items that have emerged as potential replacements. Additionally, geil is used to analyse innovative morphological productiveness, demonstrating how different senses of geil are used as a base lexeme or affixoid in compounding and derivation.
2. Syntactic developments. The novel grammaticalization of several subordinating conjunctions into both coordinating conjunctions and discourse markers is examined. The investigation is supported by statistical analyses that demonstrate an increase in the use of non‐standard syntax over the timeframe of the corpus and compares the results with other corpora of written language.
3. Orthography and the metacommunicative features of digital writing. This analysis identifies orthographic features and strategies in the corpus, e.g. the repetition of certain emoji, and develops a holistic framework to study metacommunicative functions, such as the communication of illocutionary force, information structure, or the expression of identities. The framework unifies previous research that had focused on individual features, integrating a wide range of metacommunicative strategies within a single, robust system of analysis.
By using qualitative and computational analytical frameworks within corpus linguistic methods, the thesis identifies emergent linguistic features in digital youth language in German and sheds further light on lexical and morphosyntactic changes and trends in the language of young people over the period 2008‐2018. The study has also further developed and augmented existing analytical frameworks to widen the scope of their application to orthographic features associated with digital writing.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.
This paper presents the application of the <tiger2/> format to various linguistic scenarios with the aim of making it the standard serialisation for the ISO 24615 [1] (SynAF) standard. After outlining the main characteristics of both the SynAF metamodel and the <tiger2/> format, as extended from the initial Tiger XML format [2], we show through a range of different language families how <tiger2/> covers a variety of constituency and dependency based analyses.
This introductory tutorial describes a strictly corpus-driven approach for uncovering indications for aspects of use of lexical items. These aspects include ‘(lexical) meaning’ in a very broad sense and involve different dimensions, they are established in and emerge from respective discourses. Using data-driven mathematical-statistical methods with minimal (linguistic) premises, a word’s usage spectrum is summarized as a collocation profile. Self-organizing methods are applied to visualize the complex similarity structure spanned by these profiles. These visualizations point to the typical aspects of a word’s use, and to the common and distinctive aspects of any two words.