Refine
Year of publication
Document Type
- Conference Proceeding (71) (remove)
Has Fulltext
- yes (71)
Keywords
- Korpus <Linguistik> (29)
- Digital Humanities (13)
- Annotation (12)
- Auszeichnungssprache (10)
- Computerlinguistik (8)
- Forschungsdaten (7)
- Langzeitarchivierung (7)
- XML (7)
- Sprachdaten (6)
- Metadaten (5)
Publicationstate
- Veröffentlichungsversion (54)
- Zweitveröffentlichung (9)
- Postprint (3)
Reviewstate
- (Verlags)-Lektorat (41)
- Peer-Review (16)
Publisher
- European Language Resources Association (ELRA) (15)
- Extreme Markup Languages Conference (6)
- University of Illinois (3)
- University of Oulu (3)
- Universität Tübingen (3)
- Universität zu Köln (3)
- Clarin (2)
- European Language Resources Association (2)
- McGill University & Université de Montréal (2)
- Nisaba (2)
Gegenstand des Workshop-Beitrags ist die Verknüpfung heterogener linguistischer Ressourcen. Eine bedeutende Teilmenge von Ressourcen in der gegenwärtigen linguistischen Forschung und Anwendung besteht zum einen aus XML-annotierten Textdokumenten und zum anderen aus externen Ressourcen wie Grammatiken, Lexika oder Ontologien. Es wird eine Architektur vorgestellt, die eine Integration heterogener Ressourcen erlaubt, wobei die Methoden zur Integration unabhängig von der jeweiligen Anwendung sind und somit verschiedene Verknüpfungen ermöglichen. Eine exemplarische Anwendung der Methodologie ist die Analyse anaphorischer Beziehungen.
This abstract discusses the possibility to adopt a CLARIN Data Protection Code of Conduct pursuant art. 40 of the General Data Protection Regulation. Such a code of conduct would have important benefits for the entire language research community. The final section of this abstract proposes a roadmap to the CLARIN Data Protection Code of Conduct, listing various stages of its drafting and approval procedures.
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
The Meta-data-Database of a Next Generation Sustainability Web-Platform for Language Resources
(2008)
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.
This paper describes the efforts in the field of sustainability of the Institut für Deutsche Sprache (IDS) in Mannheim with respect to DEREKO (Deutsches Referenzkorpus) the Archive of General Reference Corpora of Contemporary Written German. With focus on re-usability and sustainability, we discuss its history and our future plans. We describe legal challenges related to the creation of a large and sustainable resource; sketch out the pipeline used to convert raw texts to the final corpus format and outline migration plans to TEI P5. Due to the fact, that the current version of the corpus management and query system is pushed towards its limits, we discuss the requirements for a new version which will be able to handle current and future DEREKO releases. Furthermore, we outline the institute’s plans in the field of digital preservation.