Korpuslinguistik
Refine
Year of publication
Document Type
- Part of a Book (105) (remove)
Has Fulltext
- yes (105)
Keywords
- Korpus <Linguistik> (92)
- Deutsch (26)
- Gesprochene Sprache (12)
- Annotation (11)
- Grammatik (10)
- Institut für Deutsche Sprache <Mannheim> (6)
- Kollokation (5)
- Kontrastive Linguistik (5)
- Sprachvariante (5)
- Automatische Sprachanalyse (4)
Publicationstate
- Veröffentlichungsversion (105) (remove)
Reviewstate
- (Verlags)-Lektorat (66)
- Peer-Review (35)
- Peer-review (2)
- Verlags-Lektorat (2)
Publisher
In diesem Beitrag wird untersucht, wie mithilfe korpuslinguistischer Verfahren Erkenntnisse über den Aufbau von Bedeutungsparaphrasen in Wörterbüchern gewonnen werden können. Diese Erkenntnisse sollen dazu genutzt werden, den Aufbau von Bedeutungsparaphrasen in Wörterbüchern umfassend und systematisch zu beschreiben, z.B. im Hinblick auf eine Optimierung der Bedeutungsparaphrasen für so genannte elektronische Wörterbücher oder für die Extraktion lexikalisch-semantischer Information für NLP-Zwecke.
In many European languages, propositional arguments (PAs) can be realized as different types of structures. Cross-linguistically, complex structures with PAs show a systematic correlation between the strength of the semantic bond and the syntactic union (cf. Givón 2001; Wurmbrand/Lohninger 2023). Also, different languages show similarities with respect to the (lexical) licensing of different PAs (cf. Noonan 1985; Givón 2001; Cristofaro 2003 on different predicate types). However, on a more fine-grained level, a variation across languages can be observed both with respect to the syntactic-semantic properties of PAs as well as to their licensing and usage. This presentation takes a multi-contrastive view of different types of PAs as syntactic subjects and objects by looking at five European languages: EN, DE, IT, PL and HU. Our goal is to identify the parameters of variation in the clausal domain with PAs and by this to contribute to a better understanding of the individual language systems on the one hand and the nature of the linguistic variation in the clausal domain on the other hand. Phenomena and Methodology: We investigate the following types of PAs: direct object (DO) clauses (1), prepositional object (PO) clauses (2), subject clauses (3), and nominalizations (4, 5). Additionally, we discuss clause union phenomena (6, 7). The analyzed parameters include among others finiteness, linear position of the PA, (non) presence of a correlative element, (non) presence of a complementizer, lexical-semantic class of the embedding verb. The phenomena are analyzed based on corpus data (using mono- and multilingual corpora), experimental data (acceptability judgement surveys) or introspective data.
In this paper, we present first results of training a classifier for discriminating Russian texts into different levels of difficulty. For the classification we considered both surface-oriented features adopted from readability assessments and more linguistically informed, positional features to classify texts into two levels of difficulty. This text classification is the main focus of our Levelled Study Corpus of Russian (LeStCoR), in which we aim to build a corpus adapted for language learning purposes – selecting simpler texts for beginner second language learners and more complex texts for advanced learners. The most discriminative feature in our pilot study was a lexical feature that approximates accessibility of the vocabulary by the second language learner in terms of the proportion of familiar words in the texts. The best feature setting achieved an accuracy of 0.91 on a pilot corpus of 209 texts.
The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
CONTRIBUTIONS TO THE STUDY OF GERMAN USAGE A CORPUS-BASED APPROACH
This paper outlines some basic assumptions and principles underlying the corpus linguistics research and some application domains at the Institute for German Language in Mannheim. We briefly address three complementary but closely related tasks: first, the acquisition of very large corpora, second, the research on statistical methods for automatically extracting information about associations between word configurations, and, third, meeting the challenge of understanding the explanatory power of such methods both in theoretical linguistics and in other fields such as second language acquisition or lexicography. We argue that a systematic statistical analysis of huge bodies of text can reveal substantial insights into the language usage und change, far beyond just collocational patterning.
Um gesprochene Sprache leichter analysieren zu können, müssen zuvor die auf Audio- oder Videokassetten befindlichen Aufnahmen transkribiert werden. Dabei kommt der Darstellung von Synchronität des Gesprochenen z.B. in Partiturschreibweise und dem Annotieren von Situationen, Verhalten einzelner Diskursteilnehmer u.dgl. eine bedeutende Rolle zu. Die Vielfalt der transkribierten Details und Informationsebenen setzt ein differenziertes Kodierungsschema voraus. Des Weiteren besteht bei der Gesprächsanalyse der Wunsch, neben dem Auffinden bestimmter Stellen im Schriftmaterial (Transkript) auch deren akustisches Ereignis wiedergeben zu können, was die Synchronisation von Text und Aufnahme voraussetzt.
Im Folgenden wird nach einer Einleitung, welche die Geschichte und Motive für die in diesem Papier beschriebenen Komponenten kurz darstellt, eine Zusammenfassung linguistischer Desiderate für die Erschließung von Gesprächskorpora präsentiert und im Anschluss daran ein Modell für Diskurstranskripte vorgestellt, das die technische Grundlage für die diskursanalytische Erschließung von Gesprächskorpora am Institut für Deutsche Sprache (IDS) durch den Computer bildet. Anschließend wird der technische Prozess der Korpuserstellung skizziert, gefolgt von der Beschreibung dreier dabei zum Einsatz kommenden Werkzeuge, des DIDA-Editors, des SPRAT-Alignment- Systems und des DMM-Konverters. Schließlich wird die Volltextdatenbank COSMAS II vorgestellt, mit der die Analyse in den resultierenden SGML-Diskurstranskripten durchgeführt wird. Im Mittelpunkt steht dabei die Fähigkeit von COSMAS II, mit Hilfe der aus der Diskursstruktur abgeleiteten Diskursmetrik eine breite Palette von Suchanfragen zu ermöglichen und sie mit Hilfe der grafischen Suchanfragekomponente als SGML-Suchanfragen zu formulieren. Abschließend wird kurz auf die geplante Weiterentwicklung eingegangen.