Refine
Year of publication
Document Type
- Conference Proceeding (22)
- Part of a Book (12)
- Other (2)
- Report (1)
Has Fulltext
- yes (37)
Keywords
- Korpus <Linguistik> (37) (remove)
Publicationstate
- Veröffentlichungsversion (37) (remove)
Reviewstate
- (Verlags)-Lektorat (23)
- Peer-Review (8)
Publisher
This paper presents the application of the <tiger2/> format to various linguistic scenarios with the aim of making it the standard serialisation for the ISO 24615 [1] (SynAF) standard. After outlining the main characteristics of both the SynAF metamodel and the <tiger2/> format, as extended from the initial Tiger XML format [2], we show through a range of different language families how <tiger2/> covers a variety of constituency and dependency based analyses.
The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.
Poster des Text+ Partners Leibniz-Institut für Deutsche Sprache Mannheim präsentiert beim Workshop "Wohin damit? Storing and reusing my language data" am 22. Juni 2023 in Mannheim. Das Poster wurde im Kontext der Arbeit des Vereins Nationale Forschungsdateninfrastruktur (NFDI) e.V. verfasst. NFDI wird von der Bundesrepublik Deutschland und den 16 Bundesländern finanziert, und das Konsortium Text+ wird gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 460033370. Die Autor:innen bedanken sich für die Förderung sowie Unterstützung. Ein Dank geht außerdem an alle Einrichtungen und Akteur:innen, die sich für den Verein und dessen Ziele engagieren.
This paper deals with the problem of how to interrelate theory-specific treebanks and how to transform one treebank format to another. Currently, two approaches to achieve these goals can be differentiated. The first creates a mapping algorithm between treebank formats. Categories of a source format are transformed into a target format via a given set of general or language-specific mapping rules. The second relates treebanks via a transformation to a general model of linguistic categories, for example based on the EAGLES recommendations for syntactic annotations of corpora, or relying on the HPSG framework. This paper proposes a new methodology as a solution for these desiderata.
This document presents ongoing work related to spoken language data within a project that aims to establish a common and unified infrastructure for the sustainable provision of linguistic primary research data at the Institut für Deutsche Sprache (IDS). In furtherance of its mission to “document the German language as it is currently used”, the project expects to enable the research community to access a broad empirical base of working material via a single platform. While the goal is to eventually cover all linguistically relevant digital resources of the IDS, including lexicographic information systems such as the IDS German Vocabulary Portal, OWID, written language corpora such as the IDS German Reference Corpus, DeReKo, and spoken language corpora such as the IDS German Speech Corpus for Research and Teaching, FOLK, the work presented here predominantly focuses on the latter type of data, i.e. speech corpora. Within this context, the present document pictures the project’s contributions to the development of standards and best practice guidelines concerning data storage, process documentation and legal issues for the sustainable preservation and long-term accessibility of primary linguistic research data.
This paper introduces the recently started DRuKoLA-project that aims at providing mechanisms to flexibly draw virtual comparable corpora from the German Reference Corpus DeReKo and the Reference Corpus of Contemporary Romanian Language CoRoLa in order to use these virtual corpora as empirical basis for contrastive linguistic research.
The aim of the paper is twofold. Firstly, an approach is presented how to select the correct antecedent for an anaphoric element according to the kind of text segments in which both of them occur. Basically, information on logical text structure (e.g. chapters, sections, paragraphs) is used in order to select the antecedent life span of a linguistic expression, i.e. some linguistic expressions are more likely to be chosen as an antecedent throughout the whole text than others. In addition, an appropriate search scope for an anaphora expressed by an expression can be defined according to the document structuring elements that include the linguistic expression. Corpus investigations give rise to the supposition that logical text structure influences the search scope of candidates for antecedents. Second, a solution is presented how to integrate the resources used for anaphora resolution. In this approach, multi-layered XML annotation is used in order to make a set of resources accessible for the anaphora resolution system.
Igel is a small XQuery-based web application for examining a collection of document grammars; in particular, for comparing related document grammars to get a better overview of their differences and similarities. In its initial form, Igel reads only DTDs and provides only simple lists of constructs in them (elements, attributes, notations, parameter entities). Our continuing work is aimed at making Igel provide more sophisticated and useful information about document grammars and building the application into a useful tool for the analysis (and the maintenance!) of families of related document grammars
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
Als Teil der NFDI vernetzt Text+ ortsverteilt verschiedenste Daten und Dienste für die geisteswissenschaftliche Forschung und stellt sie der wissenschaftlichen Gemeinschaft FAIR zur Verfügung. In diesem Beitrag beschreiben wir die Umsetzung beispielhaft im Bereich der Text+ Datendomäne Sammlungen anhand von Korpora, die in verschiedenen Disziplinen Verwendung finden. Die Infrastruktur ist auf Erweiterbarkeit ausgelegt, so dass auch weitere Ressourcen über Text+ verfügbar gemacht werden können. Enthalten ist auch ein Ausblick auf weitere zu erwartende Entwicklungen. Ein Beitrag zur 9. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" - DHd 2023 Open Humanities Open Culture.
In this paper we present an approach to faceted search in large language resource repositories. This kind of search which enables users to browse through the repository by choosing their personal sequence of facets heavily relies on the availability of descriptive metadata for the objects in the repository. This approach therefore informs the collection of a minimal set of metatdata for language resources. The work described in this paper has been funded by the EC within the ESFRI infrastructure project CLARIN.
Providing online repositories for language resources is one of the main activities of CLARIN centres. The legal framework regarding liability of Service Providers for content uploaded by their users has recently been modified by the new Directive on Copyright in the Digital Single Market. A new category of Service Providers, Online Content-Sharing Service Providers (OCSSPs), was added. It is subject to a complex and strict framework, including the requirement to obtain licenses from rightholders for the hosted content. This paper provides the background and effect of these changes to law and aims to initiate a debate on how CLARIN repositories should navigate this new legal landscape.