Refine
Document Type
- Conference Proceeding (12)
- Article (5)
- Part of a Book (2)
Has Fulltext
- yes (19)
Is part of the Bibliography
- no (19)
Keywords
- Korpus <Linguistik> (5)
- Langzeitarchivierung (5)
- Annotation (4)
- Digital Humanities (3)
- Metadaten (3)
- Sprachdaten (3)
- Auszeichnungssprache (2)
- Datenschutz (2)
- Digitale Sprachressourcen (2)
- E-Learning (2)
Publicationstate
- Veröffentlichungsversion (11)
- Postprint (3)
- Zweitveröffentlichung (3)
Reviewstate
- (Verlags)-Lektorat (13)
- Verlags-Lektorat (1)
Publisher
- University of Illinois (3)
- Extreme Markup Languages Conference (2)
- Narr (2)
- Oxford University Press (2)
- University of Oulu (2)
- Berlin-Brandenburgische Akademie der Wissenschaften (1)
- European Language Resources Association (ELRA) (1)
- ICCC Press (1)
- Johns Hopkins University Pres (1)
- Sociedad Española para el procesamiento del Lenguaje Natural (1)
We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s spe cific research needs. SPLICR also provides a graphical interface that enables users to query and to visualise corpora. The project in which the system is developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annotations. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collec- tions of language resources.
This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. The initiative is a cooperation between three collaborative research centres in Germany – the SFB 441 “Linguistic Data Structures” in Tübingen, the SFB 538 “Multilingualism” in Hamburg, and the SFB 632 “Information Structure” in Potsdam/Berlin. The aim of the project is to develop methods for sustainable archiving of the diverse bodies of linguistic data used at the three sites. In the first half of the paper, the data handling solutions developed so far at the three centres are briefly introduced. This is followed by an assessment of their commonalities and differences and of what these entail for the work of the new joint initiative. The second part then sketches seven areas of open questions with respect to sustainable data handling and gives a more detailed account of two of them – integration of linguistic terminologies and development of best practice guidelines.
Digital Text Collections, Linguistic Research Data, and Mashups: Notes on the Legal Situation
(2008)
Comprehensive data repositories are an essential part of practically all research carried out in the digital humanities nowadays. For example, library science, literary studies, and computational and corpus linguistics strongly depend on online archives that are highly sustainable and that contain not only digitized texts but also audio and video data as well as additional information such as metadata and arbitrary annotations. Current Web technologies, especially those that are related to what is commonly referred to as the Web 2.0, provide a number of novel functions such as multiuser editing or the inclusion of third-party content and applications that are also highly attractive for research applications in the areas mentioned above. Hand in hand with this development goes a high degree of legal uncertainty. The special nature of the data entails that, in quite a few cases, there are multiple holders of personal rights (mostly copyright) to different layers of data that often have different origins. This article discusses the legal problems of multiple authorships in private, commercial, and research environments. We also introduce significant differences between European and U.S. law with regard to the handling of this kind of data for scientific purposes.
Der vorliegende Artikel skizziert die Möglichkeiten, die durch den Gebrauch offener Standards im Bereich des eLearning und Web Based Trainig (WBT) eröffnet werden. Ausgehend von den Erfahrungen aus dem BMBF-Projekt MiLCA ("Medienintensive Lehrmodule in der Computerlinguistik-Ausbildung") werden die Vorteile einer XML basierten Markupsprache in Verbindung mit einer Open Source WBT-Plattform für die Strukturierung von Lernobjekten diskutiert. Dabei ist die Realisierung eines vollständigen XML Imports in das WBT-System nur der erste Schritt in einer sehr viel weiter gehenden Entwicklung, in der textlinguistische und computerlinguistische Methoden mehr und mehr an Bedeutung gewinnen. So wird zum Beispiel der Gebrauch von didaktisch motivierten Metadaten Autoren in die Lage versetzen, Lernobjekte adaptiv und lernerzentriert aufzubereiten. Die Integration von Ontologien und Taxonomien ist ein weiterer Aspekt, der noch präzisere Möglichkeiten der Wartung und Wiederverwendung von Lernobjekten eröffnet. Teil dieses Artikels ist ein annotiertes Beispiel-Lernobjekt zur Verdeutlichung der oben angesprochenen Entwicklungen und deren Auswirkungen auf die zukünftige akademische Ausbildung.
From Open Source to Open Information. Collaborative Methods in Creating XML-based Markup Languages
(2000)
Linguistic corpora have been annotated by means of SGML-based markup languages for almost 20 years. We can, very roughly, differentiate between three distinct evolutionary stages of markup technologies. (1)Originally, single SGML tree-based document instances were deemed sufficient for the representation of linguistic structures. (2) Linguists began to realize that alternatives and extensions to the traditional model are needed. Formalisms such as, for example, NITE were proposed: the NITE Object Model (NOM) consists of multi-rooted trees. (3) We are now on the threshold of the third evolutionary stage: even NITE's very flexible approach is not suited for all linguistic purposes. As some structures, such as these, cannot be modeled by multi-rooted trees, an even more flexible approach is needed in order to provide a generic annotation format that is able to represent genuinely arbitrary linguistic data structures.
On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees
(2007)
The Generalised Architecture for Sustainability (GENAU) provides a framework for the transformation of single-file, multi-layer annotations into multi-rooted trees. By employing constraints expressed in XCONCUR-CL, this procedure can be performed lossless, i.e., without losing information, especially with regard to the nesting of elements that belong to multiple annotation layers. This article describes how different types of linguistic corpora can be transformed using specialised tools, and how constraint rules can be applied to the resulting multi-rooted trees to add an additional level of validation.
Grundlage dieses Artikels* 1 ist das Verbundprojekt „Nachhaltigkeit linguistischer Daten“ der drei Sonderforschungsbereiche 441, 538 und 632, dessen Ziel es ist, Lösungen für die nachhaltige Verfügbarkeit der an den SFBs vorhandenen Korpora zu entwickeln. Ein zentraler Aspekt betrifft die Klärung der Rechtslage für die Nutzung und Weitergabe linguistischer Ressourcen, die durch das Urheber- sowie das Datenschutzrecht geschützt sind. Eine als indifferent wahrgenommene rechtliche Situation wird in der Praxis oft als das entscheidende Hindernis für die Weitergabe linguistischer Daten angeführt. Tatsächlich jedoch sind Nutzung und Weitergabe von Daten zu wissenschaftlichen Zwecken normativ geregelt. Problematisch ist oftmals die Einordnung der speziellen linguistischen Daten als Schutzgegenstand sowie die Tatsache, dass an linguistische Daten und Datensammlungen aufgrund ihrer komplexen und vielschichtigen Beschaffenheit durchaus mehrere Urheber Rechte besitzen können, die sich auf verschiedene Inhalte beziehen. Der Beitrag gibt einen Überblick über das geltende Recht sowie die juristischen und natürlichen Personen, die potentiell Rechte an linguistisch aufbereiteten Datenkollektionen besitzen. Es ist nicht Gegenstand dieses Artikels, rechtsverbindliche Aussagen zu treffen, die auf eine Nutzung und Weitergabe jedweder Daten angewandt werden. Der Artikel orientiert sich in seiner Struktur und thematischen Tiefe bewusst nicht an einem juristischen Publikum, sondern beschreibt die Problematik aus geisteswissenschaftlicher Perspektive. Zusammen mit einem Überblick über das vom Umgang mit linguistischen Datensammlungen betroffene Recht, das Urheberrechtsgesetz (Abschnitt 1) und das Bundesdatenschutzgesetz (Abschnitt 2), wird in den jeweiligen Abschnitten auch eine Klassifikation der Daten aus juristischer Sicht vorgenommen. Anschließend werden Lösungsansätze vorgestellt, die im Rahmen des o. g. Verbundprojektes erarbeitet werden (Abschnitt 3).
We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s specific research needs. SPLICR also provides an interface that enables users to query and to visualise corpora. The project in which the system is being developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annota-tions. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collections of language resources.
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by the authors is described.
This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of Standardization (ISO) on developing standards for linguistic resources. The individual projects have been conducted at German collaborative research centres at the Universities of Potsdam, Hamburg and Tübingen, where the sustainability work was coordinated.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.
The Meta-data-Database of a Next Generation Sustainability Web-Platform for Language Resources
(2008)
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.