Sprache im 20. Jahrhundert. Gegenwartssprache
Refine
Document Type
- Conference Proceeding (10) (remove)
Has Fulltext
- yes (10)
Keywords
- Gesprochene Sprache (10) (remove)
Publicationstate
Reviewstate
- Review-Status-unbekannt (2)
- (Verlags)-Lektorat (1)
- Peer-Review (1)
We present the annotation of information structure in the MULI project. To learn more about the information structuring means in prosody, syntax and discourse, theory- independent features were defined for each level. We describe the features and illustrate them on an example sentence. To investigate the interplay of features, the representation has to allow for inspecting all three layers at the same time. This is realised by a stand-off XML mark-up with the word as the basic unit. The theory-neutral XML stand-off annotation allows integrating this resource with other linguistic resources such as the Tiger Treebank for German or the Penn treebank for English.
We present an XML-based metadata standard for the documentation of speech and multimedia corpora that was developed at the Institute for German Language (IDS) in Mannheim, Germany. The IDS is one of the major institutions providing German speech and language corpora to researchers. These corpora stem from many different sources and were previously documented in a rather heterogeneous fashion using a variety of data models and formats. In order to unify the documentation for existing and future corpora, the IDS- internal Archive for Spoken German collaborated with several projects and developed a set of standardised XML metadata schemas. These XML schemas build on existing internal and external documentation schemas (such as IMDI) and take into account the workflow of speech corpus production. In order to minimise redundancy, separate schemas were designed for projects, speakers, recording sessions, and entire corpora. The resulting schemas are tested in ongoing speech and multi-media projects at the IDS and are regularly revised. They are accompanied by element definitions, guidelines, and examples. In addition, a mapping to IMDI will be provided.
The metadata management system for speech corpora “memasysco” has been developed at the Institut für Deutsche Sprache (IDS) and is applied for the first time to document the speech corpus “German Today”. memasysco is based on a data model for the documentation of speech corpora and contains two generic XML schemas that drive data capture, XML native database storage, dynamic publishing, and information retrieval. The development of memasysco’s information architecture was mainly based on the ISLE MetaData Initiative (IMDI) guidelines for publishing metadata of linguistic resources. However, since we also have to support the corpus management process in research projects at the IDS, we need a finer atomic granularity for some documentation components as well as more restrictive categories to ensure data integrity. The XML metadata of different speech corpus projects are centrally validated and natively stored in an Oracle XML database. The extension of the system to the management of annotations of audio and video signals (e.g. orthographic and phonetic transcriptions) is planned for the near future.
The current paper presents a corpus containing 35 dialogues of spontaneously spoken southern German, including half an hour of articulography for 13 of the speakers. Speakers were seated in separate recording chambers, mimicking a telephone call, and recorded on individual audio channels. The corpus provides manually corrected word boundaries and automatically aligned segment boundaries. Annotations are provided in the Praat format. In addition to audio recordings, speakers filled out a detailed questionnaire, assessing among others their audio-visual consumption habits.
The aim of this paper is to highlight the actual need for corpora that have been annotated based on acoustic information. The acoustic information should be coded in features or properties and is needed to inform further processing systems, i.e. to present a basis for a speech recognition system using linguistic information. Feature annotation of existing corpora in combination with segmental annotation can provide a powerful training material for speech recognition systems, but will as well challenge the further processing of features to segments and syllables. We present here the theoretical preliminaries for our multilingual feature extraction system, that we are currently working on.
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
Bericht über die 15. Arbeitstagung zur Gesprächsforschung vom 30. März - 1. April 2011 in Mannheim
(2011)
Instrumente für die Arbeit mit Korpora gesprochener Sprache. Text-Ton-Alignment und COSMAS II
(2000)
The research project “German Today” aims to determine the amount of regional variation in (near-) standard German spoken by young and older educated adults, and to identify and locate the regional features. To this end, an extensive corpus of read and spontaneous speech is currently being compiled. German is a so-called pluricentric language. With our corpus we aim to determine whether national or regional standards really exist. Furthermore, the linguistic variation due to different contextual styles (read vs. spontaneous speech) shall be analysed. Finally, the corpus will enable us to investigate whether linguistic change has occurred in the domain of the German standard language. The main focus of all research questions is on phonetic variation (lexical variation is only of minor interest). Read and spontaneous speech of four secondary school students (aged seventeen to twenty) and two fifty- to sixt-year-olds is recorded in 160 cities throughout the German-speaking area of Europe. All participants read a number of short texts and word lists, name pictures, translate from English, and take part in a sociobiographic interview and a map task experiment. The resulting corpus will comprise over 1000 hours of orthographically and (in part) phonetically transcribed speech.