Refine
Year of publication
- 2020 (72) (remove)
Document Type
- Article (31)
- Conference Proceeding (19)
- Part of a Book (8)
- Review (7)
- Part of Periodical (4)
- Book (3)
Language
- English (46)
- German (24)
- Multiple languages (2)
Keywords
- Korpus <Linguistik> (26)
- Forschungsdaten (15)
- Computerlinguistik (12)
- Deutsch (10)
- Gesprochene Sprache (9)
- Datenmanagement (7)
- Interaktion (7)
- Computerunterstützte Kommunikation (6)
- Konversationsanalyse (6)
- Fremdsprachenlernen (5)
Publicationstate
- Veröffentlichungsversion (72) (remove)
Reviewstate
- Peer-Review (72) (remove)
Publisher
- CLARIN (6)
- Leibniz-Institut für Deutsche Sprache (IDS) (6)
- Erich Schmidt (5)
- European Language Resources Association (5)
- Association for Computational Linguistics (4)
- Verlag für Gesprächsforschung (4)
- Linköping University Electronic Press (3)
- MDPI (3)
- Suomen soveltavan kielitieteen yhdistys AFinLA (3)
- de Gruyter (3)
Sogenannte „Pragmatikalisierte Mehrworteinheiten“ sind im Deutschen hochfrequent und unterliegen bisweilen tiefgreifenden phonetischen Reduktionsprozessen. Diese können Realisierungsvarianten hervorbringen, die in der Rückschau auf mehr als eine lexematische Ursprungsform zurückführbar sind. Die vorliegende Studie untersucht mit [ˈzɐmɐ] einen besonders prägnanten Fall dieser Art anhand eines Perzeptionsexperimentes.
Song lyrics can be considered as a text genre that has features of both written and spoken discourse, and potentially provides extensive linguistic and cultural information to scientists from various disciplines. However, pop songs play a rather subordinate role in empirical language research so far - most likely due to the absence of scientifically valid and sustainable resources. The present paper introduces a multiply annotated corpus of German lyrics as a publicly available basis for multidisciplinary research. The resource contains three types of data for the investigation and evaluation of quite distinct phenomena: TEI-compliant song lyrics as primary data, linguistically and literary motivated annotations, and extralinguistic metadata. It promotes empirically/statistically grounded analyses of genre-specific features, systemic-structural correlations and tendencies in the texts of contemporary pop music. The corpus has been stratified into thematic and author-specific archives; the paper presents some basic descriptive statistics, as well as the public online frontend with its built-in evaluation forms and live visualisations.
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).
This study examines asymmetries between so-called inherent and contextual categories in relation to the morphological complexity of the nominal and verbal inflectional domain of languages. The observations are traced back to the influence of adult L2 learning in scenarios of intense language contact. A method for a simple comparison of the amount of inherent versus contextual categories is proposed and applied to the German-based creole language Unserdeutsch (Rabaul Creole German) in comparison to its lexifier language. The same procedure will be applied to two further language pairs. The grammatical systems of Unserdeutsch and other contact languages display a noticeable asymmetry regarding their structural complexity. Analysing different kinds of evidence, the explanatory key factor seems to be the role of (adult) L2 acquisition in the history of a language, whereby languages with periods of widespread L2 acquisition tend to lose contextual features. This impression is reinforced by general tendencies in pidgin and creole languages. Beyond that, there seems to be a tendency for inherent categories to be more strongly associated with the verb, while contextual categories seem to be more strongly associated with the noun. This leads to an asymmetry in categorical complexity between the noun phrase and the verb phrase in languages that experienced periods of intense L2 learning.
CLARIN contractual framework for sharing language data: the perspective of personal data protection
(2020)
The article analyses the responsibility for ensuring compliance with the General Data Protection Regulation (GDPR) in research settings. As a general rule, organisations are considered the data controller (responsible party for the GDPR compliance). Research constitutes a unique setting influenced by academic freedom. This raises the question of whether academics could be considered the controller as well. However, there are some court cases and policy documents on this issue. It is not settled yet. The analysis serves a preliminary analytical background for redesigning CLARIN contractual framework for sharing data.
We present web services which implement a workflow for transcripts of spoken language following the TEI guidelines, in particular ISO 24624:2016 “Language resource management – Transcription of spoken language”. The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.
In this Paper, we describe a schema and models which have been developed for the representation of corpora of computer-mediated communicatin (CMC corpora) using the representation framework provided by the Text Encoding Initiative (TEI). We characterise CMC discourse as dialogic, sequentially organised interchange between humans and point out that many features of CMC are not adequately handled by current corpus encoding schemas and tools. We formulate desiderata for a representation of CMC in encoding schemes and argue why the TEI is a suitable framework for the encoding of CMC corpora. We propose a model of basic CMC units (utterances, posts, and nonverbal activities) and the macro- and micro-level structures of interactions in CMC environments. Based on these models, we introduce CMC-core, a TEI customisation for the encoding of CMC corpora, which defines CMC-specific encoding features on the four levels of elements, model classes, attribute classes, and modules of the TEI infrastructure. The description of our customisation is illustrated by encoding examples from corpora by researchers of the TEI SIG CMC, representing a variety of CMC genres, i.e. chat, wiki talk, twitter, blog, and Second Life interactions. The material described, i.e. schemata, encoding examples, and documentation, is available from the of the TEI CMC SIG Wiki and will accompany a feature request to the TEI council in late 2019.