Refine
Year of publication
Document Type
- Conference Proceeding (12)
- Article (4)
- Contribution to a Periodical (4)
- Book (2)
- Other (1)
- Preprint (1)
Keywords
- Computerlinguistik (24) (remove)
Publicationstate
Reviewstate
- Peer-Review (3)
Publisher
- CLARIN (2)
- Universität Hamburg - Sonderforschungsbereich 538 (2)
- Verlag für Gesprächsforschung (2)
- ELRA (1)
- Europ. Akad. (1)
- European Language Resources Association (1)
- European Language Resources Association (ELRA) (1)
- Gardez!-Verl. (1)
- Gesellschaft für Sprachtechnologie and Computerlinguistik e.V. (1)
- Lambert-Lucas (1)
As a part of the ZuMult-project, we are currently modelling a backend architecture that should provide query access to corpora from the Archive of Spoken German (AGD) at the Leibniz-Institute for the German Language (IDS). We are exploring how to reuse existing search engine frameworks providing full text indices and allowing to query corpora by one of the corpus query languages (QLs) established and actively used in the corpus research community. For this purpose, we tested MTAS - an open source Lucene-based search engine for querying on text with multilevel annotations. We applied MTAS on three oral corpora stored in the TEI-based ISO standard for transcriptions of spoken language (ISO 24624:2016). These corpora differ from the corpus data that MTAS was developed for, because they include interactions with two and more speakers and are enriched, inter alia, with timeline-based annotations. In this contribution, we report our test results and address issues that arise when search frameworks originally developed for querying written corpora are being transferred into the field of spoken language.
This paper describes EXMARaLDA, an XML-based framework for the construction, dissemination and analysis of corpora of spoken language transcriptions. Departing from a prototypical example of a “partitur” (musical score) transcription, the EXMARaLDA “single timeline, multiple tiers” data model and format is presented alongside with the EXMARaLDA Partitur-Editor, a tool for inputting and visualizing such data. This is followed by a discussion of the interaction of EXMARaLDA with other frameworks and tools that work with similar data models. Finally, this paper presents an extension of the “single timeline, multiple tiers” data model and describes its application within the EXMARaLDA system.
Time-based data models and the Text Encoding Initiative’s guidelines for transcription of speech
(2005)
This paper describes EXMARaLDA, a system for computer transcription of spoken discourse developed and used by the SFB "Mehrsprachigkeit" at the university of Hamburg. EXMARaLDA consists of several DTDs for XML coding of transcription data and some input and output tools for these formats. Apart from being a transcription system in its own right, EXMARaLDA also plays the role of a mediator between older existing data formats at the SFB and between these formats and a planned database of multilingual spoken discourse.
This paper describes the TEI-based ISO standard 2462:2016 “Transcription of spoken language” and other formats used within CLARIN for spoken language resources. It assesses the current state of support for the standard and the interoperability between these formats and with relevant tools and services. The main idea behind the paper is that a digital infrastructure providing language resources and services to researchers should also allow the combined use of resources and/or services from different contexts. This requires syntactic and semantic interoperability. We propose a solution based on the ISO/TEI format and describe the necessary steps for this format to work as an exchange format with basic semantic interoperability for spoken language resources across the CLARIN infrastructure and beyond.
This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. This initiative is a cooperation between three linguistic collaborative research centres in Germany, which comprise more than 40 individual research projects altogether. These projects are involved in creating manifold language resources, especially corpora, tailored to their particular needs. The aim of the project described here is to ensure an effective and sustainable access of these data by third-party researchers beyond the termination of these projects. This goal involves a number of measures, such as the definition of a common data format to completely capture the heterogeneous information encoded in the individual corpora, the development of user-friendly and sustainably usable tools for processing (e.g. querying) the data, and the specification of common inventories of metadata and terminology. Moreover, the project aims at formulating general rules of best practice for creating, accessing, and archiving linguistic resources.
Stellungnahme zu Wolfgang Schneiders Artikel "Annotate in Transkriptionen aus DV-technischer Sicht"
(2002)
This contribution addresses the workshop topic of “standardising policies within eHumanities infrastructures”. It relates 10 years of experience with language resource standards, gained in the development of EXMARaLDA, a system for the construction and exploitation of spoken language corpora. Section 2 gives an overview of the EXMARaLDA system focussing on its relationship with existing and evolving standards for language resources. Section 3 presents the HIAT system as an example of an established community practice. Section 4 then addresses several issues that where encountered when trying to bring together HIAT, EXMARaLDA and the wider standard world.