Refine
Year of publication
- 2009 (17) (remove)
Document Type
- Part of a Book (8)
- Conference Proceeding (4)
- Article (3)
- Book (1)
- Working Paper (1)
Language
- English (17) (remove)
Has Fulltext
- yes (17)
Is part of the Bibliography
- no (17) (remove)
Keywords
- Korpus <Linguistik> (7)
- Deutsch (4)
- Annotation (3)
- Computerlinguistik (2)
- Institut für Deutsche Sprache <Mannheim> (2)
- Schriftsprache (2)
- Verb (2)
- Auszeichnungssprache (1)
- Automatische Sprachanalyse (1)
- Automatisches Beweisverfahren (1)
Publicationstate
- Veröffentlichungsversion (12)
- Postprint (4)
- Zweitveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (17) (remove)
Publisher
This dossier consists of an introduction to the region under study, followed by six sections each dealing with a specific level of the education system. These brief descriptions contain factual information presented in a readily accessible way. Sections eight to ten cover research, prospects, and summary statistics. For detailed information and political discussions about language use at the various levels of education, the reader is referred to other sources with a list of publications.
Concurrent standardization as a necessity: The genesis of the new official orthographic guidelines
(2009)
The new official orthographic guidelines were brought into force by the official state authorities on August 1st, 1998 and its principle goals were a standardized representation of the guidelines and a «gentle simplification in respect of content». This regulation was not supported by the public and in fact it was the starting point for a struggle for conceptual solutions and a quest for the achievement of' a consensus between different possible norms. Since orthography is an officially codified standard taking up a prominent position among linguistic standards, it is of particular socio-political importance. It was the foremost task of the Council for German Orthography (Rat für deutsche Rechtschreibung), instituted in December 2004, to elaborate a compromise in order to bring the «Orthographical war» (Die Zeit) to an end, which was led enthusiastically for more than a decade. - The concern of this article is to classify historically the agreement reached in 2006. Against this background, it can be stated that official guidelines will only be accepted, if they are based upon the usage in writing and if they take into account the interests of the reader. Both principles are characterizing the proposal made by the Council for German Orthography. An outlook on the Council's activities concerning orthographic standardization expected in the future will conclude this article.
From Proof Texts to Logic. Discourse Representation Structures for Proof Texts in Mathematics
(2009)
We present an extension to Discourse Representation Theory that can be used to analyze mathematical texts written in the commonly used semi-formal language of mathematics (or at least a subset of it). Moreover, we describe an algorithm that can be used to check the resulting Proof Representation Structures for their logical validity and adequacy as a proof.
In this paper we present an approach to faceted search in large language resource repositories. This kind of search which enables users to browse through the repository by choosing their personal sequence of facets heavily relies on the availability of descriptive metadata for the objects in the repository. This approach therefore informs the collection of a minimal set of metatdata for language resources. The work described in this paper has been funded by the EC within the ESFRI infrastructure project CLARIN.
While written corpora can be exploited without any linguistic annotations, speech corpora need at least a basic transcription to be of any use for linguistic research. The basic annotation of speech data usually consists of time-aligned orthographic transcriptions. To answer phonetic or phonological research questions, phonetic transcriptions are needed as well. However, manual annotation is very time-consuming and requires considerable skill and near-native competence. Therefore it can take years of speech corpus compilation and annotation before any analyses can be carried out. In this paper, approaches that address the transcription bottleneck of speech corpus exploitation are presented and discussed, including crowdsourcing the orthographic transcription, automatic phonetic alignment, and query-driven annotation. Currently, query-driven annotation and automatic phonetic alignment are being combined and applied in two speech research projects at the Institut für Deutsche Sprache (IDS), whereas crowdsourcing the orthographic transcription still awaits implementation.
In spite of the obvious importance that is accorded to the notion grammatical construction in any approach that sees itself as a construction grammar (CxG), there is as yet no generally accepted definition of the term across different variants of the framework. In particular, there are different assumptions about which additional requirements a given structure has to meet in order to be recognized as a construction besides being a ‘form-meaning pair’. Since the choice of a particular definition will determine the range of both relevant phenomena and concrete observations to be considered in empirical research within the framework, the issue is not just a mere terminological quibble but has important methodological repercussions especially for quantitative research in areas such as corpus linguistics. The present study illustrates some problems in identifying and delimiting such patterns in naturally occurring text and presents arguments for a usage-based interpretation of the term grammatical construction.
TEI Feature Structures as a Representation Format for Multiple Annotation and Generic XML Documents
(2009)
Feature structures are mathematical entities (rooted labeled directed acyclic graphs) that can be represented as graph displays, attribute value matrices or as XML adhering to the constraints of a specialized TEI tag set. We demonstrate that this latter ISO-standardized format can be used as an integrative storage and exchange format for sets of multiple annotation XML documents. This specific domain of application is rooted in the approach of multiple annotations, which marks a possible solution for XML-compliant markup in scenarios with conflicting annotation hierarchies. A more extreme proposal consists in the possible use as a meta-representation format for generic XML documents. For both scenarios our strategy concerning pertinent feature structure representations is grounded on the XDM (XQuery 1.0 and XPath 2.0 Data Model). The ubiquitous hierarchical and sequential relationships within XML documents are represented by specific features that take ordered list values. The mapping to the TEI feature structure format has been implemented in the form of an XSLT 2.0 stylesheet. It can be characterized as exploiting aspects of both the push and pull processing paradigm as appropriate. An indexing mechanism is provided with regard to the multiple annotation documents scenario. Hence, implicit links concerning identical primary data are made explicit in the result format. In comparison to alternative representations, the TEI-based format does well in many respects, since it is both integrative and well-formed XML. However, the result documents tend to grow very large depending on the size of the input documents and their respective markup structure. This may also be considered as a downside regarding the proposed use for generic XML documents. On the positive side, it may be possible to achieve a hookup to methods and applications that have been developed for feature structure representations in the fields of (computational) linguistics and knowledge representation.
We present data-driven methods for the acquisition of LFG resources from two German treebanks. We discuss problems specific to semi-free word order languages as well as problems arising from the data structures determined by the design of the different treebanks. We compare two ways of encoding semi-free word order, as done in the two German treebanks, and argue that the design of the TiGer treebank is more adequate for the acquisition of LFG resources. Furthermore, we describe an architecture for LFG grammar acquisition for German, based on the two German treebanks, and compare our results with a hand-crafted German LFG grammar.
The paper contributes to the raising vs. control debate with respect to modals through (A) novel data; (B) the investigation of a domain in which it has proven particularly problematic: volitional modality. We analyze oblique arguments of experiencer verbs embedded under German wollen ‘want’ and propose that they support both generalized raising and the abandonment of the classical version of the Theta Criterion. Byproducts of the analysis include a syntactic account involved in a class of datives in the language together with the initial characterization of a related modal in German which is expressed through the same item as volition and which we term weak.
In this paper we address the question of what is needed, in terms of morphosyntactic encoding, to relate a so-called verb-specific modifier to a nominal head. For the purposes of this paper we shall assume that the notion of a verb-specific modifier includes adverbs and their phrasal or clausal projections, adpositional phrases, and noun phrases featuring a particular semantic case such as locative or instrumental. Noun-specific modifiers, in turn, are considered to be first and foremost adjectives and adjective phrases, next participles and their phrasal projections and, finally, relative clauses.1 The basic motivation underlying this distinction relates to markedness.
This introductory tutorial describes a strictly corpus-driven approach for uncovering indications for aspects of use of lexical items. These aspects include ‘(lexical) meaning’ in a very broad sense and involve different dimensions, they are established in and emerge from respective discourses. Using data-driven mathematical-statistical methods with minimal (linguistic) premises, a word’s usage spectrum is summarized as a collocation profile. Self-organizing methods are applied to visualize the complex similarity structure spanned by these profiles. These visualizations point to the typical aspects of a word’s use, and to the common and distinctive aspects of any two words.
This article introduces the topic of ‘‘Multilingual language resources and interoperability’’. We start with a taxonomy and parameters for classifying language resources. Later we provide examples and issues of interoperatability, and resource architectures to solve such issues. Finally we discuss aspects of linguistic formalisms and interoperability.
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by the authors is described.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.