Korpuslinguistik
Refine
Year of publication
- 2004 (8) (remove)
Document Type
- Conference Proceeding (6)
- Part of a Book (2)
Has Fulltext
- yes (8)
Is part of the Bibliography
- no (8)
Keywords
- Korpus <Linguistik> (6)
- Annotation (3)
- Gesprochene Sprache (2)
- Auszeichnungssprache (1)
- Automatische Sprachanalyse (1)
- Automatische Spracherkennung (1)
- Co-Reference (1)
- Concurrent Markup/Overlap (1)
- Data Architecture (1)
- Data Formats (1)
Publicationstate
Reviewstate
Um gesprochene Sprache leichter analysieren zu können, müssen zuvor die auf Audio- oder Videokassetten befindlichen Aufnahmen transkribiert werden. Dabei kommt der Darstellung von Synchronität des Gesprochenen z.B. in Partiturschreibweise und dem Annotieren von Situationen, Verhalten einzelner Diskursteilnehmer u.dgl. eine bedeutende Rolle zu. Die Vielfalt der transkribierten Details und Informationsebenen setzt ein differenziertes Kodierungsschema voraus. Des Weiteren besteht bei der Gesprächsanalyse der Wunsch, neben dem Auffinden bestimmter Stellen im Schriftmaterial (Transkript) auch deren akustisches Ereignis wiedergeben zu können, was die Synchronisation von Text und Aufnahme voraussetzt.
Im Folgenden wird nach einer Einleitung, welche die Geschichte und Motive für die in diesem Papier beschriebenen Komponenten kurz darstellt, eine Zusammenfassung linguistischer Desiderate für die Erschließung von Gesprächskorpora präsentiert und im Anschluss daran ein Modell für Diskurstranskripte vorgestellt, das die technische Grundlage für die diskursanalytische Erschließung von Gesprächskorpora am Institut für Deutsche Sprache (IDS) durch den Computer bildet. Anschließend wird der technische Prozess der Korpuserstellung skizziert, gefolgt von der Beschreibung dreier dabei zum Einsatz kommenden Werkzeuge, des DIDA-Editors, des SPRAT-Alignment- Systems und des DMM-Konverters. Schließlich wird die Volltextdatenbank COSMAS II vorgestellt, mit der die Analyse in den resultierenden SGML-Diskurstranskripten durchgeführt wird. Im Mittelpunkt steht dabei die Fähigkeit von COSMAS II, mit Hilfe der aus der Diskursstruktur abgeleiteten Diskursmetrik eine breite Palette von Suchanfragen zu ermöglichen und sie mit Hilfe der grafischen Suchanfragekomponente als SGML-Suchanfragen zu formulieren. Abschließend wird kurz auf die geplante Weiterentwicklung eingegangen.
The motivation for this article is to describe a methodology for interrelating and analyzing language and theory-specific corpus data from various languages. As an example phenomeon we use information structure (IS, see [3]) in treebanks from three languages: Spanish, Korean and Japanese. Korean and Japanese are typologically close, while both are typologically different from Spanish. Therefore, the problem of annotating IS is that there are diverging language-specific formal linguistic means for the realization of IS-functions (like “topicalization / contrast”) on various levels like prosody, morphology and word-order. Hence, it is necessary to describe the relations between language-specific formal means and functional views on IS, and how to operationalize these relations for corpus analysis.
We present the annotation of information structure in the MULI project. To learn more about the information structuring means in prosody, syntax and discourse, theory- independent features were defined for each level. We describe the features and illustrate them on an example sentence. To investigate the interplay of features, the representation has to allow for inspecting all three layers at the same time. This is realised by a stand-off XML mark-up with the word as the basic unit. The theory-neutral XML stand-off annotation allows integrating this resource with other linguistic resources such as the Tiger Treebank for German or the Penn treebank for English.
The aim of this paper is to highlight the actual need for corpora that have been annotated based on acoustic information. The acoustic information should be coded in features or properties and is needed to inform further processing systems, i.e. to present a basis for a speech recognition system using linguistic information. Feature annotation of existing corpora in combination with segmental annotation can provide a powerful training material for speech recognition systems, but will as well challenge the further processing of features to segments and syllables. We present here the theoretical preliminaries for our multilingual feature extraction system, that we are currently working on.
Linguistische Korpora
(2004)
This paper proposes a methodology for querying linguistic data represented in different corpus formats. Examples of the need for queries over such heterogeneous resources are the corpus-based analysis of multimodal phenomena like the interaction of gestures and prosodic features, or syntax-related phenomena like information structure which exceed the expressive power of a tree-centered corpus format. Query languages (QLs) currently under development are strongly connected to corpus formats, like the NITE Object Model (NOM, Carletta et al., 2003) or the Meta-Annotation Infrastructure for ATLAS (MAIA, Laprun and Fiscus, 2002). The parallel development of linguistic query languages and corpus formats is due to the fact that general purpose query languages like XQuery (Boag et al., 2003) do not fulfill the changing needs of linguistically motivated queries, e.g. to give access to (non-)hierarchically organized, theory and language dependent annotations of multi modal signals and/or text. This leads to the problem that existing corpus formats and query languages are hard to reuse. They have to be re developed and re-implemented time-consumingly and expensively for unforeseen tasks. This paper describes an approach for overcoming these problems and a sample application.
This paper describes a corpus of Japanese task-oriented dialogues, i.e. its data, annotations, analysis methodology and preliminary results for the modeling of co-referential phenomena. Current corpus based approaches to co-reference concentrate on textual data from English or other European languages. Hence, the emerging language-general models of co-reference miss input from dialogue data of non-European languages. We aim to fill this gap and contribute to a model of co-reference on various language-specific and language-general levels.
Overlap in markup occurs where some markup structures do not nest, such as where the structural division of the text into lists, sections, etc., differs from the syntactic division of the text into sentences and phrases. The Multiple Annotation solution to this problem (redundant encoding in multiple forms) has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. But it has the significant disadvantage of independence of the separate files. These multiply annotated files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) can be programmatically derived and used together for editing, for inference, or for unification of the multiply annotated documents.