Refine
Document Type
- Part of a Book (6)
- Article (5)
- Conference Proceeding (5)
- Book (1)
Has Fulltext
- yes (17)
Keywords
- Korpus <Linguistik> (14)
- Gesprochene Sprache (8)
- Deutsch (7)
- oral corpora (4)
- Abfragesprache (3)
- Fremdsprache (3)
- Mündliche Kommunikation (3)
- Computerlinguistik (2)
- Datenbank (2)
- Korpora in DaF/DaZ (2)
Publicationstate
- Veröffentlichungsversion (13)
- Zweitveröffentlichung (3)
- Postprint (1)
Reviewstate
- (Verlags)-Lektorat (8)
- Peer-Review (6)
Publisher
- European Language Resources Association (ELRA) (4)
- Leibniz-Institut für Deutsche Sprache (IDS) (2)
- Beltz Juventa (1)
- Edinburgh University Press (1)
- European Language Resources Association (1)
- Institut für Deutsche Sprache (1)
- Narr Francke Attempto (1)
- UCREL (1)
- Uniwersytet im. Adama Mickiewicza w Poznaniu (1)
- Zenodo (1)
As a part of the ZuMult-project, we are currently modelling a backend architecture that should provide query access to corpora from the Archive of Spoken German (AGD) at the Leibniz-Institute for the German Language (IDS). We are exploring how to reuse existing search engine frameworks providing full text indices and allowing to query corpora by one of the corpus query languages (QLs) established and actively used in the corpus research community. For this purpose, we tested MTAS - an open source Lucene-based search engine for querying on text with multilevel annotations. We applied MTAS on three oral corpora stored in the TEI-based ISO standard for transcriptions of spoken language (ISO 24624:2016). These corpora differ from the corpus data that MTAS was developed for, because they include interactions with two and more speakers and are enriched, inter alia, with timeline-based annotations. In this contribution, we report our test results and address issues that arise when search frameworks originally developed for querying written corpora are being transferred into the field of spoken language.
The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
Beim Kontakt der substandardsprachlichen deutschen Varietäten, die von Aussiedlern der Einwanderungsgeneration aus deutschen Sprachinseln der ehemaligen Sowjetunion mitgebracht wurden, mit der Standardsprache und den binnendeutschen Regionalvarietäten ergeben sich Veränderungen spezifischer Art, wie sie im deutschsprachigen Raum bei einheimischen Dialektsprechern bei der Konvergenz infolge von Standard/Dialekt-Variation nicht vorliegen. Wenn Sprecher aus einer Sprachinsel kommen, dann aktivieren sie im Laufe des Aufenthaltes in Deutschland ihre Variationsmuster auf Grund der dialektalen Vorkenntnisse des Deutschen und weiten ihr Repertoire in den standardsprachlichen und zum Teil auch regionalsprachlichen Bereich des Deutschen aus. Diesem Prozess und seinen Folgen ist die vorliegende Publikation gewidmet.
The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.
This contribution presents the background, design and results of a study of users of three oral corpus platforms in Germany. Roughly 5.000 registered users of the Database for Spoken German (DGD), the GeWiss corpus and the corpora of the Hamburg Centre for Language Corpora (HZSK) were asked to participate in a user survey. This quantitative approach was complemented by qualitative interviews with selected users. We briefly introduce the corpus resources involved in the study in section 2. Section 3 describes the methods employed in the user studies. Section 4 summarizes results of the studies focusing on selected key topics. Section 5 attempts a generalization of these results to larger contexts.
In this paper, we address two problems in indexing and querying spoken language corpora with overlapping speaker contributions. First, we look into how token distance and token precedence can be measured when multiple primary data streams are available and when transcriptions happen to be tokenized, but are not synchronized with the sound at the level of individual tokens. We propose and experiment with a speaker based search mode that enables any speaker’s transcription tier to be the basic tokenization layer whereby the contributions of other speakers are mapped to this given tier. Secondly, we address two distinct methods of how speaker overlaps can be captured in the TEI based ISO Standard for Spoken Language Transcriptions (ISO 24624:2016) and how they can be queried by MTAS – an open source Lucene-based search engine for querying text with multilevel annotations. We illustrate the problems, introduce possible solutions and discuss their benefits and drawbacks.