OPUS 4 | Sprache im 20. Jahrhundert. Gegenwartssprache

Sprache im 20. Jahrhundert. Gegenwartssprache

40 search hits

11 to 20

Sort by

Multi-dimensional annotation of linguistic corpora for investigating information structure (2004)

Baumann, Stefan ; Brinckmann, Caren ; Hansen-Schirra, Silvia ; Kruijff, Geert-Jan ; Kruijff-Korbayová, Ivana ; Neumann, Stella ; Teich, Elke

We present the annotation of information structure in the MULI project. To learn more about the information structuring means in prosody, syntax and discourse, theory- independent features were defined for each level. We describe the features and illustrate them on an example sentence. To investigate the interplay of features, the representation has to allow for inspecting all three layers at the same time. This is realised by a stand-off XML mark-up with the word as the basic unit. The theory-neutral XML stand-off annotation allows integrating this resource with other linguistic resources such as the Tiger Treebank for German or the Penn treebank for English.

The MULI Project: Annotation and Analysis of Information Structure in German and English (2004)

Baumann, Stefan ; Brinckmann, Caren ; Hansen-Schirra, Silvia ; Kruijff, Geert-Jan ; Kruijff-Korbayová, Ivana ; Neumann, Stella ; Steiner, Erich ; Teich, Elke ; Uszkoreit, Hans

The goal of the MULI (MUltiLingual Information structure) project is to empirically analyse information structure in German and English newspaper texts. In contrast to other projects in which information structure is annotated and investigated (e.g. in the Prague Dependency Treebank, which mirrors the basic information about the topic-focus articulation of the sentence), we do not annotate theory-biased categories like topic-focus or theme-rheme. Trying to be as theory-independent as possible, we annotate those features which are relevant to information structure and on the basis of which typical patterns, co-occurrences or correlations can be determined. We distinguish between three annotation levels: syntax, discourse and prosody. The data is based on the TIGER Corpus for German and the Penn Treebank for English, since the existing information on part-of-speech and syntactic structure can be re-used for our purposes. The actual annotation of an English example sequence illustrates our choice of categories on each level. Their combination offers the possibility to investigate how information structure is realised and can be interpreted.

Minimally Redundant Metadata Schemas for Speech Corpora (2007)

Brinckmann, Caren ; Dickgießer, Sylvia ; Gasch, Joachim

We present an XML-based metadata standard for the documentation of speech and multimedia corpora that was developed at the Institute for German Language (IDS) in Mannheim, Germany. The IDS is one of the major institutions providing German speech and language corpora to researchers. These corpora stem from many different sources and were previously documented in a rather heterogeneous fashion using a variety of data models and formats. In order to unify the documentation for existing and future corpora, the IDS- internal Archive for Spoken German collaborated with several projects and developed a set of standardised XML metadata schemas. These XML schemas build on existing internal and external documentation schemas (such as IMDI) and take into account the workflow of speech corpus production. In order to minimise redundancy, separate schemas were designed for projects, speakers, recording sessions, and entire corpora. The resulting schemas are tested in ongoing speech and multi-media projects at the IDS and are regularly revised. They are accompanied by element definitions, guidelines, and examples. In addition, a mapping to IMDI will be provided.

memasysco: XML schema based metadata management system for speech corpora (2008)

Gasch, Joachim ; Brinckmann, Caren ; Dickgießer, Sylvia

The metadata management system for speech corpora “memasysco” has been developed at the Institut für Deutsche Sprache (IDS) and is applied for the first time to document the speech corpus “German Today”. memasysco is based on a data model for the documentation of speech corpora and contains two generic XML schemas that drive data capture, XML native database storage, dynamic publishing, and information retrieval. The development of memasysco’s information architecture was mainly based on the ISLE MetaData Initiative (IMDI) guidelines for publishing metadata of linguistic resources. However, since we also have to support the corpus management process in research projects at the IDS, we need a finer atomic granularity for some documentation components as well as more restrictive categories to ensure data integrity. The XML metadata of different speech corpus projects are centrally validated and natively stored in an Oracle XML database. The extension of the system to the management of annotations of audio and video signals (e.g. orthographic and phonetic transcriptions) is planned for the near future.

Accessing Language Specific Linguistic Information for Triphone Model Generation: Feature Tables in a Speech Recognition System (2005)

Kanokphara, Supphanat ; Geumann, Anja ; Carson-Berndsen, Julie

This paper is concerned with a novel methodology for generating phonetic questions used in tree-based state tying for speech recognition. In order to implement a speech recognition system, language-dependent knowledge which goes beyond annotated material is usually required. The approach presented here generates phonetic questions for decision trees are based on a feature table that summarizes the articulatory characteristics of each sound. On the one hand, this method allows better language-specific triphone models to be defined given only a feature-table as linguistic input. On the other hand, the feature-table approach facilitates efficient definition of triphone models for other languages since again only a feature table for this language is required. The approach is exemplified with speech recognition systems for English and Thai.

Converting a Corpus into a Hypertext: An Approach Using XML Topic Maps and XSLT (2002)

Lenz, Eva Anna ; Storrer, Angelika

In the context of the HyTex project, our goal is to convert a corpus into a hypertext, basing conversion strategies on annotations which explicitly mark up the text-grammatical structures and relations between text segments. Domain-specific knowledge is represented in the form of a knowledge net, using topic maps. We use XML as an interchange format. In this paper, we focus on a declarative rule language designed to express conversion strategies in terms of text-grammatical structures and hypertext results. The strategies can be formulated in a concise formal syntax which is independend of the markup, and which can be transformed automatically into executable program code.

Visualizing Language Change in a Corpus of Contemporary German (2017)

Fankhauser, Peter ; Kupietz, Marc

Visual Correlation for Detecting Patterns in Language Change (2017)

Fankhauser, Peter ; Kupietz, Marc

A Multilingual Phonological Resource Toolkit for Ubiquitous Speech Technology (2004)

Aioanei, Daniel ; Carson-Berndsen, Julie ; Geumann, Anja ; Kelly, Robert ; Neugebauer, Moritz ; Wilson, Stephen

This paper outlines the generation process of a specifi computational linguistic representation termed the Multilingual Time Map, conceptually a multi-tape finit state transducer encoding linguistic data at different levels of granularity. The fi st component acquires phonological data from syllable labeled speech data, the second component define feature profiles the third component generates feature hierarchies and augments the acquired data with the define feature profiles and the fourth component displays the Multilingual Time Map as a graph.

The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings (2016)

Arnold, Denis ; Tomaschek, Fabian

The current paper presents a corpus containing 35 dialogues of spontaneously spoken southern German, including half an hour of articulography for 13 of the speakers. Speakers were seated in separate recording chambers, mimicking a telephone call, and recorded on individual audio channels. The corpus provides manually corrected word boundaries and automatically aligned segment boundaries. Annotations are provided in the Praat format. In addition to audio recordings, speakers filled out a detailed questionnaire, assessing among others their audio-visual consumption habits.

11 to 20

Open Access

Sprache im 20. Jahrhundert. Gegenwartssprache

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

40 search hits