Refine
Year of publication
- 2020 (64) (remove)
Document Type
- Part of a Book (25)
- Conference Proceeding (19)
- Article (8)
- Book (5)
- Review (2)
- Doctoral Thesis (1)
- Master's Thesis (1)
- Other (1)
- Report (1)
- Working Paper (1)
Keywords
- Korpus <Linguistik> (64) (remove)
Publicationstate
- Veröffentlichungsversion (36)
- Zweitveröffentlichung (24)
- Postprint (4)
Reviewstate
Publisher
- European Language Resources Association (15)
- de Gruyter (8)
- CLARIN (6)
- Leibniz-Institut für Deutsche Sprache (IDS) (4)
- Heidelberg University Publishing (3)
- Linköping University Electronic Press (3)
- Universitätsverlag Rhein-Ruhr (2)
- Association for Computational Linguistics (1)
- Buro van die WAT (1)
- Digital Curation Centre (1)
Einleitung
(2020)
A corpus-based academic grammar of German is an enormous undertaking, especially if it aims at using state-of-the-art methodology while ensuring that its study results are verifiable. The Bausteine-series, which is being developed at the Leibniz Institute for the German Language (IDS), presents individual “building blocks” for such a grammar. In addition to the peer-reviewed texts, the series publishes the results of statistical analyses and, for selected topics, the underlying data sets.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
Am Leibniz-Institut für Deutsche Sprache (IDS) wurde im Programmbereich „Lexikografie und Sprachdokumentation“ ein neuartiges Wörterbuch entwickelt, das leicht verwechselbare Ausdrücke in ihrem aktuellen öffentlichen Sprachgebrauch deskriptiv beschreibt. Im Jahr 2018 erschien das elektronische Nachschlagewerk „Paronyme – Dynamisch im Kontrast“, das sich durch folgende drei Aspekte auszeichnet:
1) Erstens liegen mehrstufige kontrastive Beschreibungsebenen und flexible Darstellungsformen vor;
2) zweitens sind die Bedeutungserläuterungen kognitiv-konzeptuell angelegt, um einer langen Forderung nach einer stärker kognitiv ausgerichteten Lexikografie Rechnung zu tragen;
3) drittens werden Datengrundlagen und Analysemethoden genutzt, mit denen umfassend Paronyme ermittelt und diese anschließend erstmals empirisch ausgewertet werden konnten.
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed – despite relatively high ASR-confidence – some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
Individuals with Autism Spectrum Disorder (ASD) experience a variety of symptoms sometimes including atypicalities in language use. The study explored diferences in semantic network organisation of adults with ASD without intellectual impairment. We assessed clusters and switches in verbal fuency tasks (‘animals’, ‘human feature’, ‘verbs’, ‘r-words’) via curve ftting in combination with corpus-driven analysis of semantic relatedness and evaluated socio-emotional and motor action related content. Compared to participants without ASD (n=39), participants with ASD (n=32) tended to produce smaller clusters, longer switches, and fewer words in semantic conditions (no p values survived Bonferroni-correction), whereas relatedness and content were similar. In ASD, semantic networks underlying cluster formation appeared comparably small without afecting strength of associations or content.
This thesis describes work in three areas: grammar engineering, computer-assisted language learning and grammar learning. These three parts are connected by the concept of a grammar-based language learning application. Two types of grammars are of concern. The first we call resource grammars, extensive descriptions a natural languages. Part I focuses on this kind of grammars. The other are domain-specific or application-specific grammars. These grammars only describe a fragment of natural language that is determined by the domain of a certain application. Domain-specific grammars are relevant for Part II and Part III. Another important distinction is between humans learning a new natural language using computational grammars (Part II) and computers learning grammars from example sentences (Part III). Part I of this thesis focuses on grammar engineering and grammar testing. It describes the development and evaluation of a computational resource grammar for Latin. Latin is known for its rich morphology and free word order, both have to be handled in a computationally efficient way. A special focus is on methods how computational grammars can be evaluated using corpus data. Such an evaluation is presented for the Latin resource grammar. Part II, the central part, describes a computer-assisted language learning application based on domain-specific grammars. The language learning application demonstrates how computational grammars can be used to guide the user input and how language learning exercises can be modeled as grammars. This allows us to put computational grammars in the center of the design of language learning exercises used to help humans learn new languages. Part III, the final part, is dedicated to a method to learn domain- or application-specific grammars based on a wide-coverage grammar and small sets of example sentences. Here a computer is learning a grammar for a fragment of a natural language from example sentences, potentially without any additional human intervention. These learned grammars can be based e.g. on the Latin resource grammar described in Part II and used as domain-specific lesson grammars in the language learning application described Part II.
Blogg Dir deinen Urlaub nach Tunesien! Zur Erläuterung des Musters [VImp PROPReflexivDat NPAkk]
(2020)
In diesem Beitrag soll das Muster [VImp PROPReflexivDat NPAkk] semantisch und syntaktisch erläutert werden. Dieses Muster, das semantisch mit Verben des Erwerbens wie anschaffen korreliert, wird auch im Zusammenhang mit Kommunikationsverben wie bloggen und facebooken sowie mit dem Kontaktverb rubbeln belegt. Mithilfe des Konzeptes der Koerzion bzw. der semantischen Anpassung soll das Kovorkommen des erwänhten Musters mit diesen Verben beschrieben und erklärt werden. Als empirische Quelle dient das Korpus für das Deutsche 2012 und 2014 aus den Corpora from the Web. Die vorliegende Untersuchung ist im Rahmen meiner Dissertationsarbeit zum Thema Argumentstruktur und Bedeutung medialer Kommunikationsverben des Deutschen und des Spanischen im Sprachvergleich durchgeführt worden.
In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data management, we evaluate the 20 European CMC corpora listed in the CLARIN CMC Resource family, individuate successful strategies among the existing corpora and establish best practices for future projects. We give an overview of existing approaches to data referencing, dissemination and provision in European CMC corpora, and discuss the methods, formats and strategies used. Furthermore, we discuss the need for community standards and offer recommendations for best practices when creating a new CMC corpus.
In this Paper, we describe a schema and models which have been developed for the representation of corpora of computer-mediated communicatin (CMC corpora) using the representation framework provided by the Text Encoding Initiative (TEI). We characterise CMC discourse as dialogic, sequentially organised interchange between humans and point out that many features of CMC are not adequately handled by current corpus encoding schemas and tools. We formulate desiderata for a representation of CMC in encoding schemes and argue why the TEI is a suitable framework for the encoding of CMC corpora. We propose a model of basic CMC units (utterances, posts, and nonverbal activities) and the macro- and micro-level structures of interactions in CMC environments. Based on these models, we introduce CMC-core, a TEI customisation for the encoding of CMC corpora, which defines CMC-specific encoding features on the four levels of elements, model classes, attribute classes, and modules of the TEI infrastructure. The description of our customisation is illustrated by encoding examples from corpora by researchers of the TEI SIG CMC, representing a variety of CMC genres, i.e. chat, wiki talk, twitter, blog, and Second Life interactions. The material described, i.e. schemata, encoding examples, and documentation, is available from the of the TEI CMC SIG Wiki and will accompany a feature request to the TEI council in late 2019.