OPUS 4 | Search

38 search hits

1 to 10

Sort by

A corpus-based lexical resource of spoken German in interaction (2019)

Meliss, Meike ; Möhrs, Christine ; Ribeiro Silveira, Maria ; Schmidt, Thomas

This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction. Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.

A corpus-based lexical resource of spoken German in interaction (2019)

Meliss, Meike ; Möhrs, Christine ; Ribeiro Silveira, Maria ; Schmidt, Thomas

Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora (2020)

Arnold, Denis ; Fisseni, Bernhard ; Kamocki, Paweł ; Schonefeld, Oliver ; Kupietz, Marc ; Schmidt, Thomas

This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).

cGAT. Konventionen für das computergestützte Transkribieren in Anlehnung an das Gesprächsanalytische Transkriptionssystem 2 (GAT2) (2023)

Schmidt, Thomas ; Schütte, Wilfried ; Winterscheid, Jenny ; Schürmann, Mia ; Reineke, Silke ; Schedl, Evi

Mit dem cGAT-Handbuch stellt das FOLK-Projekt eine Richtlinie für das computergestützte Transkribieren nach GAT 2 zur Verfügung. Das Handbuch wurde anhand der Transkriptionspraxis in FOLK entwickelt und enthält eine Vielzahl von authentischen Beispielen, die mit dem zugehörigen Audio auch über die Datenbank für Gesprochenes Deutsch (DGD) abgerufen werden können.

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language (2020)

Fisseni, Bernhard ; Schmidt, Thomas

We present web services which implement a workflow for transcripts of spoken language following the TEI guidelines, in particular ISO 24624:2016 “Language resource management – Transcription of spoken language”. The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.

Connecting resources: Which issues have to be solved to integrate CMC corpora from heterogeneous sources and for different languages? (2017)

Beißwenger, Michael ; Wigham, Ciara ; Etienne, Carole ; Fišer, Darja ; Grumt Suárez, Holger ; Herzberg, Laura ; Hinrichs, Erhard ; Horsmann, Tobias ; Karlova-Bourbonus, Natali ; Lemnitzer, Lothar ; Longhi, Julien ; Lüngen, Harald ; Ho-Dac, Lydia-Mai ; Parisse, Christophe ; Poudat, Céline ; Schmidt, Thomas ; Stemle, Egon W. ; Storrer, Angelika ; Zesch, Torsten

The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.

Construction and dissemination of a corpus of spoken interaction - tools and workflows in the FOLK project (2016)

Schmidt, Thomas

This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank für Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.

Conversion and annotation web services for spoken language data in CLARIN (2017)

Schmidt, Thomas ; Hedeland, Hanna ; Jettka, Daniel

We present an approach to making existing CLARIN web services usable for spoken language transcriptions. Our approach is based on a new TEI-based ISO standard for such transcriptions. We show how existing tool formats can be transformed to this standard, how an encoder/decoder pair for the TCF format enables users to feed this type of data through a WebLicht tool chain, and why and how web services operating directly on the standard format would be useful.

Das Berliner Wendekorpus am Archiv für gesprochenes Deutsch (2019)

Schmidt, Thomas

Das Archiv für Gesprochenes Deutsch (AGD, Stift/Schmidt 2014) am Leibniz-Institut für Deutsche Sprache ist ein Forschungsdatenzentrum für Korpora des gesprochenen Deutsch. Gegründet als Deutsches Spracharchiv (DSAv) im Jahre 1932 hat es über Eigenprojekte, Kooperationen und Übernahmen von Daten aus abgeschlossenen Forschungsprojekten einen Bestand von bald 100 Variations-, Interview- und Gesprächskorpora aufgebaut, die u. a. dialektalen Sprachgebrauch, mündliche Kommunikationsformen oder die Sprachverwendung bestimmter Sprechertypen oder zu bestimmten Themen dokumentieren. Heute ist dieser Bestand fast vollständig digitalisiert und wird zu einem großen Teil der wissenschaftlichen Gemeinschaft über die Datenbank für Gesprochenes Deutsch (DGD) im Internet zur Nutzung in Forschung und Lehre angeboten.

Das MEND-Korpus im Archiv für Gesprochenes Deutsch: Entstehung, Möglichkeiten, Grenzen (2023)

Kaufmann, Göz ; Gorisch, Jan ; Schmidt, Thomas

For many reasons, Mennonite Low German is a language whose documentation and investigation is of great importance for linguistics. To date, most research projects that deal with this language and/ or its speakers have had a relatively narrow focus, with many of the data cited being of limited relevance beyond the projects for which they were collected. In order to create a resource for a broad range of researchers, especially those working on Mennonite Low German, the dataset presented here has been transformed into a structured and searchable corpus that is accessible online. The translations of 46 English, Spanish, or Portuguese stimulus sentences into Mennonite Low German by 321 consultants form the core of the MEND-corpus (Mennonite Low German in North and South America) in the Archive for Spoken German. In addition to describing the origin of this corpus and discussing possibilities and limitations for further research, we discuss the technical structure and search possibilities of the Database for Spoken German. Among other things, this database allows for a structured search of metadata, a context-sensitive token search, and the generation of virtual corpora that can be shared with others. Moreover, thanks to its text-sound alignment, one can easily switch from a particular text section of the corpus to the corresponding audio section. Aside from the desire to equip the reader with the technical knowledge necessary to use this corpus, a further goal of this paper is to demonstrate that the corpus still offers many possibilities for future research.

1 to 10

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

38 search hits