Refine
Year of publication
- 2007 (15) (remove)
Document Type
- Conference Proceeding (15) (remove)
Language
- English (15)
Has Fulltext
- yes (15)
Is part of the Bibliography
- no (15)
Keywords
- Korpus <Linguistik> (10)
- Deutsch (4)
- Annotation (3)
- Syntaktische Analyse (3)
- Kollokation (2)
- Metadaten (2)
- Anapher <Syntax> (1)
- Antonym (1)
- Archiv für Gesprochenes Deutsch (AGD) (1)
- Auszeichnungssprache (1)
Publicationstate
- Veröffentlichungsversion (15) (remove)
Reviewstate
- (Verlags)-Lektorat (10)
- Peer-Review (2)
- Review-Status-unbekannt (1)
Publisher
- University of Birmingham (3)
- University of Illinois (3)
- Association for Computational Linguistics (2)
- Extreme Markup Languages Conference (1)
- Northern European Association for Language Technology (1)
- Universita degli Studi di Bologna (1)
- University of Brimingham (1)
- University of Helsinki (1)
- University of Tartu (1)
- Universität des Saarlandes (1)
We present an XML-based metadata standard for the documentation of speech and multimedia corpora that was developed at the Institute for German Language (IDS) in Mannheim, Germany. The IDS is one of the major institutions providing German speech and language corpora to researchers. These corpora stem from many different sources and were previously documented in a rather heterogeneous fashion using a variety of data models and formats. In order to unify the documentation for existing and future corpora, the IDS- internal Archive for Spoken German collaborated with several projects and developed a set of standardised XML metadata schemas. These XML schemas build on existing internal and external documentation schemas (such as IMDI) and take into account the workflow of speech corpus production. In order to minimise redundancy, separate schemas were designed for projects, speakers, recording sessions, and entire corpora. The resulting schemas are tested in ongoing speech and multi-media projects at the IDS and are regularly revised. They are accompanied by element definitions, guidelines, and examples. In addition, a mapping to IMDI will be provided.
Evaluating phonological status: significance of paradigm uniformity vs. prosodic grouping effects
(2007)
A central concern of linguistic phonetics is to define criteria for determining the phonological status of sounds or sound properties observed in phonetic surface form. Based on acoustic measurements we show that the occurrence of syllabic sonorants vs. schwa-sonorant sequences in German is determined exclusively by segmental and prosodic structure, with no paradigm uniformity effects. We argue that these findings are consistent with a uniform representation of syllabic sonorants as schwa sonorant sequences in the lexicon. The stability of schwa in CVC-suffixes (e.g. the German diminutive suffix -chen), as opposed to its phonetic absence in a segmentally comparable underived context, is argued to be conditioned by the prosodic organisation of such suffixes external to the phonological word of the stem.
Trubetzkoy's recognition of a delimitative function of phonology, serving to signal boundaries between morphological units, is expressed in terms of alignment constraints in Optimality Theory, where the relevant constraints require specific morphological boundaries to coincide with phonological structure (Trubetzkoy 1936, 1939, McCarthy & Prince 1993). The approach pursued in the present article is to investigate the distribution of phonological boundary signals to gain insight into the criteria underlying morphological analysis. The evidence from English and Swedish suggests that necessary and sufficient conditions for word-internal morphological analysis concern the recognizability of head constituents, which include the rightmost members of compounds and head affixes. The claim is that the stability of word-internal boundary effects in historical perspective cannot in general be sufficiently explained in terms of memorization and imitation of phonological word form. Rather, these effects indicate a morphological parsing mechanism based on the recognition of word-internal head constituents. Head affixes can be shown to contrast systematically with modifying affixes with respect to syntactic function, semantic content, and prosodic properties. That is, head affixes, which cannot be omitted, often lack inherent meaning and have relatively unmarked boundaries, which can be obscured entirely under specific phonological conditions. By contrast, modifying affixes, which can be omitted, consistently have inherent meaning and have stronger boundaries, which resist prosodic fusion in all phonological contexts. While these correlations are hardly specific to English and Swedish it remains to be investigated to which extent they hold cross-linguistically. The observation that some of the constituents identified on the basis of prosodic evidence lack inherent meaning raises the issue of compositionality. I will argue that certain systematic aspects of word meaning cannot be captured with reference to the syntagmatic level, but require reference to the paradigmatic level instead. The assumption is then that there are two dimensions of morphological analysis: syntagmatic analysis, which centers on the criteria for decomposing words in terms of labelled constituents, and paradigmatic analysis, which centers on the criteria for establishing relations among (whole) words in the mental lexicon. While meaning is intrinsically connected with paradigmatic analysis (e.g. base relations, oppositeness) it is not essential to syntagmatic analysis.
On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees
(2007)
The Generalised Architecture for Sustainability (GENAU) provides a framework for the transformation of single-file, multi-layer annotations into multi-rooted trees. By employing constraints expressed in XCONCUR-CL, this procedure can be performed lossless, i.e., without losing information, especially with regard to the nesting of elements that belong to multiple annotation layers. This article describes how different types of linguistic corpora can be transformed using specialised tools, and how constraint rules can be applied to the resulting multi-rooted trees to add an additional level of validation.
We present a corpus-driven approach to the study of multi-word expressions, which constitute a significant part of. As a data basis, we use collocation profiles computed from DeReKo (Deutsches Referenzkorpus), the largest available collection of written German which has approximately two billion word tokens and is located at the Institute for the German Language (IDS). We employ a strongly usage-based approach to multi-word expressions, which we think of as conventionalised patterns in language use that manifest themselves in recurrent syntagmatic patterns of words. They are defined by their distinct function in language. To find multi-word expressions, we allow ourselves to be guided by corpus data and statistical evidence as much as possible, making interpretative steps carefully and in a monitored fashion. We develop a procedure of interpretation that leads us from the evidence of collocation profiles to a collection of recurrent word patterns and finally to multi-word expressions. When building up a collection of multi-word expressions in this fashion, it becomes clear that the expressions can be defined on different levels of generalisation and are interrelated in various ways. This will be reflected in the documentation and presentation of the findings. We are planning to add annotation in a way that allows grouping the multi-word expressions according to different features and to add links between them to reflect their relationships, thus constructing a network of multi-word expressions.
The central issue in corpus-driven linguistics is the detection and description of patterns in language usage. The features that constitute the notion of a pattern can be computed to a certain extent by statistical (collocation) methods, but a crucial part of the notion may vary depending on applications and users. Thus, typically, any computed collocation cluster will have to be interpreted hermeneutically. Often it might be captured by a generalized, more abstract pattern. We present a generic process model that supports the recognition, interpretation, and expression of the patterns inside and of the relations between clusters. By this, clusters can be merged virtually according to any notion of a 'pattern', and their relations can be exploited for different applications
Incompatibility (or co-hyponymy) is the most general type of semantic relation between lexical items, the meaning of which entails exclusion. Such items fall under a superordinate term or concept and denote sets which have no members in common (e.g. animal: dog-cat-mouse-lion-sheep; example from Cruse 2004). Traditionally, these have been of interest to lexical semanticists for the description of the structure of the lexicon. However, incompatibility is not just a relation that signifies a difference of meaning. This paper is a critical corpus-assisted re-evaluation of the phenomenon of incompatibility which argues that the relation in question sometimes also functions as a discourse marker. Incompatibles indicate recurrent intertextual patterns. This holds particularly true for socially or politically controversial lexical items such as Flexibilität (flexibility), Mobilität (mobility) or Globalisierung (globalisation). Corpus investigations of such words have revealed that among other semantically related terms, incompatibles have a crucial discourse focussing function. For the German lexical item Globalisierung, I will show how its lexical usage can be studied through a corpus-driven analysis of corresponding incompatibles. Incompatible terms are not contingent co-words but often occur in close contextual proximity and participate in regular syntagmatic structures (e.g. Globalisierung und Rationalisierung; Globalisierung und Modernisierung; Neoliberalismus, Globalisierung und Kapitalismus). Hence, these are easily extracted by conducting a computational collocation analysis. Such significant collocates provide a good insight into the discursive and thematic contexts of the search word. Following Teubert (2004), I will demonstrate how the meaning of such lexical items is constituted in discourse and how the examination of these particular collocates reveals their sense-constructing function and their pragmatic-discursive force. I will provide a brief discussion of the methodology used for such analyses, and I will explain why the complex semantic-pragmatic and thematic-communicative patterns implied in sets of incompatibles should be given a stronger emphasis in lexicography.