Korpuslinguistik
Refine
Year of publication
Document Type
- Conference Proceeding (94)
- Part of a Book (92)
- Article (51)
- Book (14)
- Part of Periodical (10)
- Other (4)
- Working Paper (4)
- Review (3)
- Doctoral Thesis (2)
- Bachelor Thesis (1)
Language
- German (151)
- English (123)
- Multiple languages (1)
Is part of the Bibliography
- no (275) (remove)
Keywords
- Korpus <Linguistik> (215)
- Deutsch (85)
- Annotation (29)
- Gesprochene Sprache (23)
- Computerlinguistik (14)
- Institut für Deutsche Sprache <Mannheim> (13)
- Grammatik (12)
- corpus linguistics (12)
- Corpus linguistics (11)
- Textkorpus (10)
Publicationstate
- Veröffentlichungsversion (168)
- Zweitveröffentlichung (52)
- Postprint (4)
Reviewstate
Publisher
- Institut für Deutsche Sprache (45)
- de Gruyter (40)
- Narr (26)
- European Language Resources Association (ELRA) (12)
- Leibniz-Institut für Deutsche Sprache (7)
- Leibniz-Institut für Deutsche Sprache (IDS) (7)
- European Language Resources Association (5)
- Niemeyer (4)
- Nisaba (4)
- Extreme Markup Languages Conference (3)
Within cognitive linguistics, there is an increasing awareness that the study of linguistic phenomena needs to be grounded in usage. Ideally, research in cognitive linguistics should be based on authentic language use, its results should be replicable, and its claims falsifiable. Consequently, more and more studies now turn to corpora as a source of data. While corpus-based methodologies have increased in sophistication, the use of corpus data is also associated with a number of unresolved problems. The study of cognition through off-line linguistic data is, arguably, indirect, even if such data fulfils desirable qualities such as being natural, representative and plentiful. Several topics in this context stand out as particularly pressing issues. This discussion note addresses (1) converging evidence from corpora and experimentation, (2) whether corpora mirror psychological reality, (3) the theoretical value of corpus linguistic studies of ‘alternations’, (4) the relation of corpus linguistics and grammaticality judgments, and, lastly, (5) the nature of explanations in cognitive corpus linguistics. We do not claim to resolve these issues nor to cover all possible angles; instead, we strongly encourage reactions and further discussion.
How (and when) do speakers generalise from memorised exemplars of a construction to a productive schema? The present paper presents a novel take on this issue by offering a corpus-based approach to semantic extension processes. Focusing on clusters of German ADJ N expressions involving the heavily polysemous adjective tief ‚deep’, it is shown that type frequency (a commonly used measure of productivity) needs to be relativised to distinct semantic classes within the overall usage spectrum of a given construction in order to predict the occurrence of novel types within a particular region of this spectrum. Some methodological and theoretical implications for usage-based linguistic model building are considered.
Linguistic corpora have been annotated by means of SGML-based markup languages for almost 20 years. We can, very roughly, differentiate between three distinct evolutionary stages of markup technologies. (1)Originally, single SGML tree-based document instances were deemed sufficient for the representation of linguistic structures. (2) Linguists began to realize that alternatives and extensions to the traditional model are needed. Formalisms such as, for example, NITE were proposed: the NITE Object Model (NOM) consists of multi-rooted trees. (3) We are now on the threshold of the third evolutionary stage: even NITE's very flexible approach is not suited for all linguistic purposes. As some structures, such as these, cannot be modeled by multi-rooted trees, an even more flexible approach is needed in order to provide a generic annotation format that is able to represent genuinely arbitrary linguistic data structures.
We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).
Overlap in markup occurs where some markup structures do not nest, such as where the structural division of the text into lists, sections, etc., differs from the syntactic division of the text into sentences and phrases. The Multiple Annotation solution to this problem (redundant encoding in multiple forms) has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. But it has the significant disadvantage of independence of the separate files. These multiply annotated files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) can be programmatically derived and used together for editing, for inference, or for unification of the multiply annotated documents.
In recent decades, the investigation of spoken language has become increasingly important in linguistic research. However, the spoken word is a fleeting phenomenon which is difficult to analyse and which requires an elaborate process of examination and appraisal. The Institute for the German Language (Institut für Deutsche Sprache) has the largest collection of recordings of spoken German, the German Speech Archive (Deutsches Spracharchiv [DSAv]). Up to now, the inadequate processing and accessibility of the valuable material held by the DSAv has been regarded as its major shortcoming. A solution to this problem is at hand now that a start has been made with the systematic modernization of the DSAv and, in particular, with the digitalization of its material. In recent years, we have been able to systematically exploit the unique opportunities provided by a new and easier form of access to the spoken language via the recorded sound signal, which can be realized digitally in the computer, and its linkage to the corresponding texts and documentary data. Through the integration of the existing data about the corpora and of the written versions of the texts into an information and full text database and through the linking of these data with the acoustic signal itself, it is now possible for us to construct a data pool which allows a better documentation of the material and provides rapid internal and external access to the sound recordings. Processed in such a way, the material of the German Speech Archive can now be regarded as having been saved for posterity. As a result, entirely new areas of inquiry and entirely new research perspectives have been opened up. This is true both for the work of the Institute itself and for linguistic research in German as a whole.
Das Werk versteht sich als eine Darstellung der wichtigsten syntaktischen, prosodischen, semantischen und pragmatischen Eigenschaften kausaler und konditionaler Konnektoren des gesprochenen Deutsch.
Die Untersuchung formuliert notwendige theoretische Grundlagen und zeigt die komplexe Interaktion mehrerer Faktoren, die sich auf die Interpretation einer Äußerung auswirken. Empirische Daten belegen, dass die kontextuelle und pragmatische Interpretation der untersuchten Relationen stark mit ihren syntaktischen und prosodischen Mustern korreliert. Jedoch handelt es sich nicht um eine Eins-zu-eins-Beziehung, denn gleiche Lesarten können von kausalen und konditionalen Relationen unterschiedlich markiert sein. Anhand der Ergebnisse wird das Verhältnis zwischen Konditionalität und Kausalität diskutiert.