Refine
Year of publication
Document Type
- Part of a Book (4500)
- Article (2966)
- Book (996)
- Conference Proceeding (688)
- Part of Periodical (308)
- Review (257)
- Other (151)
- Working Paper (83)
- Doctoral Thesis (68)
- Report (35)
Language
- German (8078)
- English (1765)
- Russian (145)
- French (38)
- Multiple languages (22)
- Spanish (16)
- Portuguese (14)
- Italian (9)
- Polish (7)
- Ukrainian (5)
Keywords
- Deutsch (5140)
- Korpus <Linguistik> (940)
- Wörterbuch (605)
- Konversationsanalyse (451)
- Rezension (423)
- Grammatik (405)
- Rechtschreibung (374)
- Gesprochene Sprache (361)
- Sprachgebrauch (356)
- Interaktion (339)
Publicationstate
- Veröffentlichungsversion (3883)
- Zweitveröffentlichung (1642)
- Postprint (395)
- Preprint (10)
- Erstveröffentlichung (8)
- Ahead of Print (7)
- (Verlags)-Lektorat (4)
- Hybrides Open Access (2)
- Verlags-Lektorat (1)
- Verlagsveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (3836)
- Peer-Review (1596)
- Verlags-Lektorat (94)
- Peer-review (56)
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (44)
- Review-Status-unbekannt (14)
- Peer-Revied (12)
- Abschlussarbeit (Bachelor, Master, Diplom, Magister) (Bachelor, Master, Diss.) (10)
- (Verlags-)Lektorat (9)
- Verlagslektorat (5)
Publisher
- de Gruyter (1334)
- Institut für Deutsche Sprache (1091)
- Schwann (638)
- Narr (484)
- Leibniz-Institut für Deutsche Sprache (IDS) (263)
- De Gruyter (245)
- Niemeyer (200)
- Lang (184)
- Narr Francke Attempto (170)
- IDS-Verlag (144)
Different Views on Markup
(2010)
In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects, multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies. As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented, a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude with a discussion of projects that have currently adopted this framework.
Digital Text Collections, Linguistic Research Data, and Mashups: Notes on the Legal Situation
(2008)
Comprehensive data repositories are an essential part of practically all research carried out in the digital humanities nowadays. For example, library science, literary studies, and computational and corpus linguistics strongly depend on online archives that are highly sustainable and that contain not only digitized texts but also audio and video data as well as additional information such as metadata and arbitrary annotations. Current Web technologies, especially those that are related to what is commonly referred to as the Web 2.0, provide a number of novel functions such as multiuser editing or the inclusion of third-party content and applications that are also highly attractive for research applications in the areas mentioned above. Hand in hand with this development goes a high degree of legal uncertainty. The special nature of the data entails that, in quite a few cases, there are multiple holders of personal rights (mostly copyright) to different layers of data that often have different origins. This article discusses the legal problems of multiple authorships in private, commercial, and research environments. We also introduce significant differences between European and U.S. law with regard to the handling of this kind of data for scientific purposes.
Grundlage dieses Artikels* 1 ist das Verbundprojekt „Nachhaltigkeit linguistischer Daten“ der drei Sonderforschungsbereiche 441, 538 und 632, dessen Ziel es ist, Lösungen für die nachhaltige Verfügbarkeit der an den SFBs vorhandenen Korpora zu entwickeln. Ein zentraler Aspekt betrifft die Klärung der Rechtslage für die Nutzung und Weitergabe linguistischer Ressourcen, die durch das Urheber- sowie das Datenschutzrecht geschützt sind. Eine als indifferent wahrgenommene rechtliche Situation wird in der Praxis oft als das entscheidende Hindernis für die Weitergabe linguistischer Daten angeführt. Tatsächlich jedoch sind Nutzung und Weitergabe von Daten zu wissenschaftlichen Zwecken normativ geregelt. Problematisch ist oftmals die Einordnung der speziellen linguistischen Daten als Schutzgegenstand sowie die Tatsache, dass an linguistische Daten und Datensammlungen aufgrund ihrer komplexen und vielschichtigen Beschaffenheit durchaus mehrere Urheber Rechte besitzen können, die sich auf verschiedene Inhalte beziehen. Der Beitrag gibt einen Überblick über das geltende Recht sowie die juristischen und natürlichen Personen, die potentiell Rechte an linguistisch aufbereiteten Datenkollektionen besitzen. Es ist nicht Gegenstand dieses Artikels, rechtsverbindliche Aussagen zu treffen, die auf eine Nutzung und Weitergabe jedweder Daten angewandt werden. Der Artikel orientiert sich in seiner Struktur und thematischen Tiefe bewusst nicht an einem juristischen Publikum, sondern beschreibt die Problematik aus geisteswissenschaftlicher Perspektive. Zusammen mit einem Überblick über das vom Umgang mit linguistischen Datensammlungen betroffene Recht, das Urheberrechtsgesetz (Abschnitt 1) und das Bundesdatenschutzgesetz (Abschnitt 2), wird in den jeweiligen Abschnitten auch eine Klassifikation der Daten aus juristischer Sicht vorgenommen. Anschließend werden Lösungsansätze vorgestellt, die im Rahmen des o. g. Verbundprojektes erarbeitet werden (Abschnitt 3).
Die Kausalkonjunktionen denn, weil, da im Deutschen und perché, poiché, siccome im Italienischen
(2011)
Gegenstand des vorliegenden Aufsatzes sind die deutschen Kausalkonjunktionen denn, weil und da und ihre (partiellen) italienischen Äquivalente perché, poiché und siccome. Sie werden vergleichend in syntaktischer und semantischer Hinsicht untersucht, mit dem Ziel, Gemeinsamkeiten und Unterschiede zwischen ihnen aufzuweisen.
Gegenstand des Workshop-Beitrags ist die Verknüpfung heterogener linguistischer Ressourcen. Eine bedeutende Teilmenge von Ressourcen in der gegenwärtigen linguistischen Forschung und Anwendung besteht zum einen aus XML-annotierten Textdokumenten und zum anderen aus externen Ressourcen wie Grammatiken, Lexika oder Ontologien. Es wird eine Architektur vorgestellt, die eine Integration heterogener Ressourcen erlaubt, wobei die Methoden zur Integration unabhängig von der jeweiligen Anwendung sind und somit verschiedene Verknüpfungen ermöglichen. Eine exemplarische Anwendung der Methodologie ist die Analyse anaphorischer Beziehungen.
Many XML-related activities (e.g. the creation of a new schema) already address issues with different languages, scripts, and cultures. Nevertheless, a need exists for additional mechanisms and guidelines for more effective internationalization (i18n) and localization (l10n) in XML-related contents and processes. The W3C Internationalization Tag Set Working Group (W3C ITS WG) addresses this need and works on data categories, representation mechanisms and guidelines related to i18n and l10n support in the XML realm. This paper describes initial findings from the (W3C ITS WG). Furthermore, the paper discusses how these findings relate to specific schema languages, and complementary technologies like namespace sectioning, schema annotation and the description of processing chains. The paper exemplifies why certain requirements only can be met by a combination of technologies, and discusses these technologies.
An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types.
In this paper, we present the Multiple Annotation approach, which solves two problems: the problem of annotating overlapping structures, and the problem that occurs when documents should be annotated according to different, possibly heterogeneous tag sets. This approach has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. The files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) are described. These representations serve as a base for several applications.
The paper discusses two topics: firstly an approach of using multiple layers of annotation is sketched out. Regarding the XML representation this approach is similar to standoff annotation. A second topic is the use of heterogeneous linguistic resources (e.g., XML annotated documents, taggers, lexical nets) as a source for semiautomatic multi-dimensional markup to resolve typical linguistic issues, dealing with anaphora resolution as a case study.
The aim of the paper is twofold. Firstly, an approach is presented how to select the correct antecedent for an anaphoric element according to the kind of text segments in which both of them occur. Basically, information on logical text structure (e.g. chapters, sections, paragraphs) is used in order to select the antecedent life span of a linguistic expression, i.e. some linguistic expressions are more likely to be chosen as an antecedent throughout the whole text than others. In addition, an appropriate search scope for an anaphora expressed by an expression can be defined according to the document structuring elements that include the linguistic expression. Corpus investigations give rise to the supposition that logical text structure influences the search scope of candidates for antecedents. Second, a solution is presented how to integrate the resources used for anaphora resolution. In this approach, multi-layered XML annotation is used in order to make a set of resources accessible for the anaphora resolution system.
In the mid-1990s, the Faculty of Linguistics and Literary-Studies at Bielefeld University began to establish the field Text technology, both in research and education. Text technology is a new field of research on the border of Computational Linguistics and Computational Philology.
This paper focuses on Text technology in academic education. In 2002, Text Technology was introduced as a minor subject for B.A. Programs. It is organized in modules: Module 1 introduces the characteristics of electronic texts and documents, typography, typesetting systems and hypertext. Module 2 introduces one or two programming languages relevant to the field of humanities computing. Markup languages and the principles of information structuring are the main topics of Module 3. The formal fundamentals of computer-based text processing, as formal languages and their grammars, Logics et cetera are subjects of another module. The paper ends with a short description of other Bachelor- and Master-Programs at Bielefeld University which contain text technological themes.
XML-based technologies offer powerful resources for open source applications in the field of e-learning. The paper describes a model of hypertext as interlinked structures that can be intertwined by cross-annotation linking. This infrastructure integrates multiple perspectives and allows creating a personal learning environment. We exemplify the approach in a case study: the Hamlet project. In the course of this project, several German translations of William Shakespeare’s Hamlet have been collected and annotated. Two different annotation layers are used to achieve a cross-linking reference between the various German translations. We will describe the theoretical background of cross-annotation linking and the actual technological implementation of the system. Additionally, we will use the personas method to gain insights into the potential benefit of the system as a personal learning environment.
Konklusiva und Konklusivität
(2004)
Ausdrucksalternativen bei Konnektoren – Varianten oder Fehler? Protokoll eines fiktiven Gesprächs
(2011)
This article introduces the topic of ‘‘Multilingual language resources and interoperability’’. We start with a taxonomy and parameters for classifying language resources. Later we provide examples and issues of interoperatability, and resource architectures to solve such issues. Finally we discuss aspects of linguistic formalisms and interoperability.
Geschlossene Klassen?
(2002)
Dieser Beitrag handelt einerseits von Konnektoren, einer Klasse von Satzverknüpfern, die vor allem die traditionell bekannten Konjunktionen umfasst, andererseits von Anaphorika, einer anderen Gruppe textteilverbindender Einheiten, bei denen man vor allem an die Pronomen denken muss. Beide Arten sprachlicher Ausdrücke dienen der Herstellung von Textkohärenz, aber auf je eigene Art und Weise; jede leistet also ihren besonderen Beitrag zur Konstitution des Diskurses, i.e. des Textes und der Rede. Jede der beiden Klassen repräsentiert somit ein grundlegendes Verfahren, aus Sätzen einen Text bzw. Textteil zu machen. Diese zwei Verfahren sollen im Folgenden kurz vorgestellt und in ihren wesentlichen Unterschieden erläutert werden. Sodann sollen in einem zweiten Hauptteil einige Wörter des Deutschen untersucht werden, die gleichzeitig nach beiden Prinzipien wirken.
Linguistische Korpora
(2004)
Making CONCUR work
(2005)
The SGML feature CONCUR allowed for a document to be simultaneously marked up in multiple conflicting hierarchical tagsets but validated and interpreted in one tagset at a time. Alas, CONCUR was rarely implemented, and XML does not address the problem of conflicting hierarchies at all. The MuLaX document syntax is a non-XML syntax that enables multiply-encoded hierarchies by distinguishing different “layers” in the hierarchy by adding a layer ID as a prefix to the element names. The IDs tie all the elements in a single hierarchy together in an “annotation layer”. Extraction of a single annotation layer results in a well-formed XML document, and each annotation layer may be associated with an XML schema. The MuLaX processing model works on the nodes of one annotation layer at a time through Xpath-like navigation. CONCUR lives!
This paper proposes a methodology for querying linguistic data represented in different corpus formats. Examples of the need for queries over such heterogeneous resources are the corpus-based analysis of multimodal phenomena like the interaction of gestures and prosodic features, or syntax-related phenomena like information structure which exceed the expressive power of a tree-centered corpus format. Query languages (QLs) currently under development are strongly connected to corpus formats, like the NITE Object Model (NOM, Carletta et al., 2003) or the Meta-Annotation Infrastructure for ATLAS (MAIA, Laprun and Fiscus, 2002). The parallel development of linguistic query languages and corpus formats is due to the fact that general purpose query languages like XQuery (Boag et al., 2003) do not fulfill the changing needs of linguistically motivated queries, e.g. to give access to (non-)hierarchically organized, theory and language dependent annotations of multi modal signals and/or text. This leads to the problem that existing corpus formats and query languages are hard to reuse. They have to be re developed and re-implemented time-consumingly and expensively for unforeseen tasks. This paper describes an approach for overcoming these problems and a sample application.
This paper describes a corpus of Japanese task-oriented dialogues, i.e. its data, annotations, analysis methodology and preliminary results for the modeling of co-referential phenomena. Current corpus based approaches to co-reference concentrate on textual data from English or other European languages. Hence, the emerging language-general models of co-reference miss input from dialogue data of non-European languages. We aim to fill this gap and contribute to a model of co-reference on various language-specific and language-general levels.
Overlap in markup occurs where some markup structures do not nest, such as where the structural division of the text into lists, sections, etc., differs from the syntactic division of the text into sentences and phrases. The Multiple Annotation solution to this problem (redundant encoding in multiple forms) has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. But it has the significant disadvantage of independence of the separate files. These multiply annotated files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) can be programmatically derived and used together for editing, for inference, or for unification of the multiply annotated documents.
Co-reference annotation and resources: a multilingual corpus of typologically diverse languages
(2002)
This article introduces a dialogue corpus containing data from two typologically different languages, Japanese and Kilivila. The corpus is annotated in accordance with language specific annotation schemes for co-referential and similar relations. The article describes the corpus data, the properties of language specific co-reference in the two languages and a methodology for its annotation. Examples from the corpus show how this methodology is used in the workflow of the annotation process.
We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).
This paper deals with the problem of how to interrelate theory-specific treebanks and how to transform one treebank format to another. Currently, two approaches to achieve these goals can be differentiated. The first creates a mapping algorithm between treebank formats. Categories of a source format are transformed into a target format via a given set of general or language-specific mapping rules. The second relates treebanks via a transformation to a general model of linguistic categories, for example based on the EAGLES recommendations for syntactic annotations of corpora, or relying on the HPSG framework. This paper proposes a new methodology as a solution for these desiderata.
Diese Arbeit beschreibt und analysiert die soziale und sprachliche Welt von Orchestermusikern. Am Beispiel des „Mikrokosmos“ eines Orchesters untersucht sie das Zusammenspiel von beruflich-strukturellen Bedingungen, kulturell vermittelten Orientierungen und sprachlichen Handlungsmustern. Dabei werden das notwendige Hintergrundwissen über die soziale Kategorien, Leitbilder und Kriterien für Professionalität sowie die Handlungsdispositionen von Orchestermusikern dargestellt. Elemente einer historischen Entwicklung der Institution Orchester werden ebenso wie biografische Erfahrungen der Musiker und Musikerinnen auf ihre Relevanz für die Kommunikation im Orchester hin untersucht.
Linguistic corpora have been annotated by means of SGML-based markup languages for almost 20 years. We can, very roughly, differentiate between three distinct evolutionary stages of markup technologies. (1)Originally, single SGML tree-based document instances were deemed sufficient for the representation of linguistic structures. (2) Linguists began to realize that alternatives and extensions to the traditional model are needed. Formalisms such as, for example, NITE were proposed: the NITE Object Model (NOM) consists of multi-rooted trees. (3) We are now on the threshold of the third evolutionary stage: even NITE's very flexible approach is not suited for all linguistic purposes. As some structures, such as these, cannot be modeled by multi-rooted trees, an even more flexible approach is needed in order to provide a generic annotation format that is able to represent genuinely arbitrary linguistic data structures.
Einführung in die Bände
(1999)
Einführung in die Bände
(1999)
On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees
(2007)
The Generalised Architecture for Sustainability (GENAU) provides a framework for the transformation of single-file, multi-layer annotations into multi-rooted trees. By employing constraints expressed in XCONCUR-CL, this procedure can be performed lossless, i.e., without losing information, especially with regard to the nesting of elements that belong to multiple annotation layers. This article describes how different types of linguistic corpora can be transformed using specialised tools, and how constraint rules can be applied to the resulting multi-rooted trees to add an additional level of validation.
Communication across all language barriers has long been a goal of humankind. In recent years, new technologies have enabled this at least partially. New approaches and different methods in the field of Machine Translation (MT) are continuously being improved, modified, and combined, as well. Significant progress has already been achieved in this area; many automatic translation tools, such as Google Translate and Babelfish, can translate not only short texts, but also complete web pages in real time. In recent years, new advances are being made in the mobile area; Googles Translate app for Android and iOS, for example, can recognize and translate words within photographs taken by the mobile device (to translate a restaurant menu, for instance). Despite this progress, a “perfect” machine translation system seems to be an impossibility because a machine translation system, however advanced, will always have some limitations. Human languages contain many irregularities and exceptions, and consequently go through a constant process of change, which is difficult to measure or to be processed automatically. This paper gives a short introduction of the state of the art of MT. It examines the following aspects: types of MT, the most conventional and widely developed approaches, and also the advantages and disadvantages of these different paradigms.
Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication.
Der Beitrag verdeutlicht am Beispiel eines bestimmten Gesprächstyps – Reklamationsgesprächen –, welche Möglichkeiten die angewandte Diskursforschung zu seiner Analyse hat und wie entsprechende Untersuchungsergebnisse für Trainings fruchtbar gemacht werden können. Speziell wird gezeigt, wie die in Reklamationen auftretenden Kommunikationsprobleme den Aufgaben eines zugrundeliegenden Handlungsschemas zugeordnet werden können und welche Möglichkeiten zur sprachlich-kommunikativen Bewältigung dieser Aufgaben bestehen.
The Meta-data-Database of a Next Generation Sustainability Web-Platform for Language Resources
(2008)
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.
This paper describes the efforts in the field of sustainability of the Institut für Deutsche Sprache (IDS) in Mannheim with respect to DEREKO (Deutsches Referenzkorpus) the Archive of General Reference Corpora of Contemporary Written German. With focus on re-usability and sustainability, we discuss its history and our future plans. We describe legal challenges related to the creation of a large and sustainable resource; sketch out the pipeline used to convert raw texts to the final corpus format and outline migration plans to TEI P5. Due to the fact, that the current version of the corpus management and query system is pushed towards its limits, we discuss the requirements for a new version which will be able to handle current and future DEREKO releases. Furthermore, we outline the institute’s plans in the field of digital preservation.
Der Beitrag verdeutlicht am Beispiel eines bestimmten Gesprächstyps - Reklamationsgesprächen -, welche Möglichkeiten die angewandte Diskursforschung zu seiner Analyse hat und wie entsprechende Untersuchungsergebnisse für Trainings fruchtbar gemacht werden können. Speziell wird gezeigt, wie die in Reklamationen auftretenden Kommunikationsprobleme den Aufgaben eines zugrundeliegenden Handlungsschemas zugeordnet werden können und welche Möglichkeiten zur sprachlich-kommunikativen Bewältigung dieser Aufgaben bestehen.
Gesprächstraining
(2004)
Der Beitrag untersucht die Praxis der Sprach- und Kommunikationsbewertung in betrieblichen Kommunikationstrainings. Wesentliches Ziel dieser Kommunikationstrainings ist die Veränderung des beruflichen Gesprächsverhaltens der Teilnehmerinnen auf der Grundlage von Gesprächssimulationen. Der Beitrag betrachtet damit einen Bereich, für den Sprach- und Kommunikationsberatung konstitutiv ist, der aber in seiner normierenden Praxis bisher kaum beachtet und analysiert wurde.
Nach einer genaueren Charakterisierung der Grundlagen und Arbeitsweise der untersuchten Kommunikationstrainings (Abschnitt2) wird an drei Beispielen aus dem empirischen Material gezeigt, wie Kritik am Kommunikationsverhalten erfolgt, sowie das zugrundeliegende Monierungsschema herausgearbeitet (Abschnitt3). Abschnitt 4 vertieft diese Analysen: Zunächst wird der Mechanismus beschrieben, auf dessen Grundlage kommunikative Probleme identifiziert werden (4.1), um dann die Normen zu untersuchen, die in diesem Prozeß eine Rolle spielen (4.2), und den Bewertungscharakter der Monita herauszustellen (4.3). Diese Analysen ermöglichen es dann, die impliziten Auflassungen von Kommunikation zu rekonstruieren, die dem Training zugrundeliegen (4.4). Den Abschluß bilden Überlegungen, in welchen Punkten sich eine diskusanalytisch fundierte Kommunikationsberatung von den hier untersuchten Trainings unterscheidet (Abschnitt5).
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by the authors is described.
Gesprochene Sprache
(2009)
Die Frage nach "Bewertungskriterien für die Sprachberatung" gibt zum einen Anlaß zum Nachdenken darüber, was wir tun, wenn wir Ratsuchenden sprachlichen Rat erteilen, und warum wir es in einer bestimmten kommunikativen Form, in einem bestimmten institutionellen (oder auch kommerziellen) Rahmen tun. Zum anderen gibt die Frage Anlaß, darüber nachzudenken, wie wir den jeweils erteilten Rat (also, das, was wir jemandem in einer sprachlichen Frage raten) begründen können, welche Kriterien wir etwa bei der Entscheidung über sogenannte sprachliche Zweifelsfälle zugrunde legen, warum wir bestimmte sprachliche Erscheinungsformen - von der Ebene der Orthographie bis zur Frage einer angemessenen Textinterpretation - so und nicht anders bewerten und auf dem Hintergrund solcher Bewertungen entsprechende Ratschläge geben oder Empfehlungen aussprechen.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
Die Kommunikation älterer Menschen (untereinander wie mit jüngeren) ist in der sprachwissenschaftlichen Forschung der Bundesrepublik ein sträflich vernachlässigtes Feld (Abschnitt 1). Zunächst werden drei verschiedene alltagsweltliche Konzepte von Alter vorgestellt (Abschnitt 2). Auf der Grundlage der Analyse authentischer Aufzeichnungen versucht der Beitrag dann, einige der Besonderheiten des Kommunikationsverhaltens älterer Menschen exemplarisch zu veranschaulichen, und er stellt die Frage, ob sich diese Besonderheiten als ein eigenständiger Kommunikationsstil auffassen lassen (Abschnitt 3). In methodologischer Hinsicht charakterisiert der Beitrag drei verschiedene Zugänge zur Erfassung altersspezifischer Phänomene: die Erfassung spezifischer Phänomene in Listenform, die Rekonstruktion der Verfahren zur interaktiven Konstitution und Akzentuierung von Alter und letztlich die Herleitung altersspezifischer Phänomene im Kommunikationsverhalten aus den Veränderungen der sozialen Lebenssituation im Alter (Abschnitt 4). Die beiden letztgenannten Zugänge werden durch empirische Analysen exemplifiziert (Abschnitt 5.2 und 5.3). Zuvor (Abschnitt 5.1) werden noch vier verschiedene typische Konstellationen unterschieden, in denen alte Menschen kommunizieren.
Neologismen
(2015)
Die Stichwortliste bildet eine tragende Säule von gedruckten als auch von Online-Wörterbüchern. Daher ist die Konzeption und Erarbeitung einer Stichwortliste auch eine zentrale Aufgabe in der Anfangsphase des am IDS in Angriff genommenen Paronymwörterbuchs. Welche Überlegungen dabei angestellt, welche Ideen entwickelt und welche Pläne bzw. Entscheidungen schließlich umgesetzt wurden, wird in dem Beitrag aufgezeigt.
Gehören nun die Männer an den Herd? Anmerkungen zum Wandel der Rollenbilder von Mann und Frau
(2015)
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
Vorwort
(1995)
This document presents ongoing work related to spoken language data within a project that aims to establish a common and unified infrastructure for the sustainable provision of linguistic primary research data at the Institut für Deutsche Sprache (IDS). In furtherance of its mission to “document the German language as it is currently used”, the project expects to enable the research community to access a broad empirical base of working material via a single platform. While the goal is to eventually cover all linguistically relevant digital resources of the IDS, including lexicographic information systems such as the IDS German Vocabulary Portal, OWID, written language corpora such as the IDS German Reference Corpus, DeReKo, and spoken language corpora such as the IDS German Speech Corpus for Research and Teaching, FOLK, the work presented here predominantly focuses on the latter type of data, i.e. speech corpora. Within this context, the present document pictures the project’s contributions to the development of standards and best practice guidelines concerning data storage, process documentation and legal issues for the sustainable preservation and long-term accessibility of primary linguistic research data.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.
Research today is often performed in collaborated projects composed of project partners with different backgrounds and from different institutions and countries. Standards can be a crucial tool to help harmonizing these differences and to create sustainable resources. However, choosing a standard depends on having enough information to evaluate and compare different annotation and metadata formats. In this paper we present ongoing work on an interactive, collaborative website that collects information on standards in the field of linguistics as a means to guide interested researchers.
Kommunikation und Emotion
(1991)
This paper describes the effort of the Institut für Deutsche Sprache (IDS), the central research institution for the German language, connected with Information and Communications Technology (ICT). Use of ICT in a language research institute is twofold. On the one hand, ICT provides basic services for researches to accomplish their daily work. On the other hand, several national and international institutions have a strong interest in ICT. Therefore, ICT can also be seen as an amplifier for language research. The first part of this paper reports on the activates of the IDS in internal and external ICT-related projects and initiatives. The second part describes a general strategy towards an ICT strategy that could be useful both for the IDS and other national language institutes. We think such a general strategy is necessary to create a strong foundation not only for the ICT-related projects, but as a basis for a modem research institute.
Gesprächstraining
(2007)
This paper shows how a corpus-driven approach leads to a new perspective on central issues of phraseology and on lexicographical applications. It argues that a data-driven pattem search (applying Statistical methods), an a posteriori interpretation of the data and a user oriented documentation of the usage of multi-word units (e. g. in lexicographical articles) constitute a step-by-step process where each step has its own informational value and useflilness. The description of multi-word units (Usuelle Wortverbindungen) presented in this paper focuses on the second Step, the high quality analysis and interpretation of collocation data, exemplified by the fields of multi-word units centered around the word formslIdee/Ideenl(idea/ideas).
Im Mittelpunkt des Beitrags steht eine Person, deren Identität wesentlich durch ihre Mehrsprachigkeit und Mehrkulturalität unter Dominanz des Deutsch-Russischen und ihren Beruf als Integrationsarbeiterin geprägt ist. Die lebenslange Entwicklung dieser Identität und ihre interpersonalen Bedingungen werden auf der Grundlage von sprachbiografischen Interviews und Proben deutsch- und russischsprachiger Kommunikation rekonstruiert. Die Diskussion dieser Entwicklung nimmt Bezug auf aktuelle Fragen der Vermittlung früher kindlicher Zweisprachigkeit und der gesellschaftlichen Gestaltung der sprachlichen Integration von Zuwanderern.