430 Deutsch
Refine
Year of publication
Document Type
- Conference Proceeding (118) (remove)
Keywords
- Deutsch (56)
- Korpus <Linguistik> (43)
- Gesprochene Sprache (12)
- Englisch (7)
- Wörterbuch (7)
- Computerunterstützte Lexikographie (6)
- Annotation (5)
- Digital Humanities (5)
- Diskursanalyse (5)
- German (5)
Publicationstate
- Veröffentlichungsversion (40)
- Zweitveröffentlichung (27)
- Postprint (1)
Reviewstate
- Peer-Review (26)
- (Verlags)-Lektorat (24)
- Review-Status-unbekannt (5)
Publisher
- European Language Resources Association (ELRA) (10)
- Lexical Computing CZ s.r.o. (4)
- Edusp/Monferrer Produções (3)
- Evangelische Akademie Loccum (3)
- Universität zu Köln (3)
- Association for Computational Linguistics (2)
- Association of Internet Researchers (2)
- Clarin (2)
- Department of Linguistics and Scandinavian Studies, University of Oslo (2)
- ELRA (2)
We start by trying to answer a question that has already been asked by de Schryver et al. (2006): Do dictionary users (frequently) look up words that are frequent in a corpus. Contrary to their results, our results that are based on the analysis of log files from two different online dictionaries indicate that users indeed look up frequent words frequently. When combining frequency information from the Mannheim German Reference Corpus and information about the number of visits in the Digital Dictionary of the German Language as well as the German language edition of Wiktionary, a clear connection between corpus and look-up frequencies can be observed. In a follow-up study, we show that another important factor for the look-up frequency of a word is its temporal social relevance. To make this effect visible, we propose a de-trending method where we control both frequency effects and overall look-up trends.
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tübingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.
Die durch die Covid-19-Pandemie bedingte Umstellung der Präsenzlehre auf digitale Lehr- und Lernformate stellte Lehrende und Studierende gleichermaßen vor eine Herausforderung. Innerhalb kürzester Zeit musste die Nutzung von Plattformen und digitalen Tools erlernt und getestet werden. Der Beitrag stellt exemplarisch Dienste und Werkzeuge von CLARIAH-DE vor und erläutert, wie die digitale Forschungsinfrastruktur Lehrende und Studierende auch im Rahmen der digitalen Lehre unterstützen kann.
Online Access Tools for Spoken German: The Resources of the Deutsches Spracharchiv in a Database
(2002)
This paper shows some details of the modernization of the Deutsches Spracharchiv (DSAv). It explores some future possibilities of linguistical documentation and analysis using the Web. The Institut für Deutsche Sprache (IDS) in Mannheim is the central institution for linguistic research in Germany. The DSAv in the IDS is the center for documentation and research of spoken German. These archives include the largest collection of sound recordings of spoken German (dialects and colloquial speech, including e.g. lots of extinct dialects of former German territories in Eastern Europe) - altogether more than 15,000 sound recordings. The lacking clarification and accessibility of this data material has been felt as an essential deficit. The opportunity to edit the sound signal digitally offers a much easier access to spoken language. Through the integration of the already existing information about the corpora and the transcribed texts in an information- and full text databank, as well as the linking of the data with the acoustic signal (alignment), arises a data-pool with considerably better documentation of the materials and a fast direct grasp of the recorded sounds. Thus, the DSAv initiates totally new research questions for the work at the IDS, as well as for linguistics altogether.