Using Full Text Indices for Querying Spoken Language Data

The search result changed since you submitted your search request. Documents might be displayed in a different sort order.

As a part of the ZuMult-project, we are currently modelling a backend architecture that should provide query access to corpora from the Archive of Spoken German (AGD) at the Leibniz-Institute for the German Language (IDS). We are exploring how to reuse existing search engine frameworks providing full text indices and allowing to query corpora by one of the corpus query languages (QLs) established and actively used in the corpus research community. For this purpose, we tested MTAS - an open source Lucene-based search engine for querying on text with multilevel annotations. We applied MTAS on three oral corpora stored in the TEI-based ISO standard for transcriptions of spoken language (ISO 24624:2016). These corpora differ from the corpus data that MTAS was developed for, because they include interactions with two and more speakers and are enriched, inter alia, with timeline-based annotations. In this contribution, we report our test results and address issues that arise when search frameworks originally developed for querying written corpora are being transferred into the field of spoken language.

Metadaten
Author:	Elena Frick ORCiD GND, Thomas Schmidt ORCiD GND
URN:	urn:nbn:de:bsz:mh39-98143
URL:	http://corpora.ids-mannheim.de/cmlc-2020.html
ISBN:	979-10-95546-61-0
Parent Title (English):	Proceedings of the LREC 2020 Workshop, Language Resources and Evaluation Conference, 11–16 May 2020, 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8)
Publisher:	European Language Resources Association
Place of publication:	Paris
Editor:	Piotr Bański, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen, Ines Pisetta
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2020
Date of Publication (online):	2020/05/12
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	MTAS; TEI; oral corpora; query; spoken language data
GND Keyword:	Abfrage; Computerlinguistik; Gesprochene Sprache; Korpus <Linguistik>; Text Encoding Initiative
First Page:	40
Last Page:	46
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Computerlinguistik
Linguistics-Classification:	Korpuslinguistik
Program areas:	P2: Mündliche Korpora
Licence (English):	Creative Commons - Attribution-NonCommercial 4.0 International

Open Access