Refine
Year of publication
- 2020 (25) (remove)
Document Type
- Conference Proceeding (11)
- Part of a Book (9)
- Article (2)
- Book (2)
- Working Paper (1)
Has Fulltext
- yes (25)
Keywords
- Forschungsdaten (25) (remove)
Publicationstate
Reviewstate
- Peer-Review (19)
- (Verlags)-Lektorat (4)
Publisher
- CLARIN (6)
- European Language Resources Association (5)
- Linköping University Electronic Press (2)
- BDÜ, Weiterbildungs- und Fachverlagsgesellschaft mbh (1)
- Digital Curation Centre (1)
- European Language Resources Association (ELRA) (1)
- L'Harmattan (1)
- Leibniz-Institut für Deutsche Sprache (IDS) (1)
- MDPI (1)
- Narr Francke Attempto (1)
This paper reports on recent developments within the European Reference Corpus EuReCo, an open initiative that aims at providing and using virtual and dynamically definable comparable corpora based on existing national, reference or other large corpora. Given the well-known shortcomings of other types of multilingual corpora such as parallel/translation corpora (shining-through effects, over-normalization, simplification, etc.) or web-based comparable corpora (covering only web material), EuReCo provides a unique linguistic resource offering new perspectives for fine-grained contrastive research on authentic cross-linguistic data, applications in translation studies and foreign language teaching and learning.
The 12th Web as Corpus workshop (WAC-XII) looks at the past, present, and future of web corpora given the fact that large web corpora are nowadays provided mostly by a few major initiatives and companies, and the diversity of the early years appears to have faded slightly. Also, we acknowledge the fact that alternative sources of data (such as data from Twitter and similar platforms) have emerged, some of them only available to large companies and their affiliates, such as linguistic data from social media and other forms of the deep web. At the same time, gathering interesting and relevant web data (web crawling) is becoming an ever more intricate task as the nature of the data offered on the web changes (for example the death of forums in favour of more closed platforms).
In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data management, we evaluate the 20 European CMC corpora listed in the CLARIN CMC Resource family, individuate successful strategies among the existing corpora and establish best practices for future projects. We give an overview of existing approaches to data referencing, dissemination and provision in European CMC corpora, and discuss the methods, formats and strategies used. Furthermore, we discuss the need for community standards and offer recommendations for best practices when creating a new CMC corpus.
The theme of the AFinLA 2020 Yearbook Methodological turns in applied language studies is discussed in this introductory article from three interrelated perspectives, variously addressed in the three plenary presentations at the AFinLA Autumn Symposium 2019 as well as in the thirteen contributions to the yearbook. In the first set of articles presented, the authors examine the role and impact of technological development on the study of multimodal digital and non-digital contexts and discourses and ensuing new methods. The second set of studies in the yearbook revisits issues of language proficiency, critically discussing relevant concepts and approaches. The third set of articles explores participation and participatory research approaches, reflecting on the roles of the researcher and the researched community.
Der vorliegende Beitrag stellt Konzeptualisierungen von Sprachlagen des individuellen Spektrums linguistischer ungeschulter Personen dar und vergleicht dabei erstmalig attitudinale Daten aus Wien und ruralen Regionen Österreichs. Im Fokus der Pilotstudie stehen dabei insbesondere Konzepte zu ‚Dialekt‘, ‚Hochdeutsch‘ sowie ‚Wienerisch‘, die sich in einem mehrstufigen Analyseverfahren als äußerst heterogen erweisen. Ihre Heterogenität speist sich aus einem komplexen Zusammenspiel soziodemografischer, regionaler, situativer sowie dynamischer Parameter, die kontrastiv diskutiert werden und auf distinkte Sprachlagenkonzepte hindeuten.
Dieser Beitrag beschreibt, welche Schritte nötig sind, um die Daten des Archivs der Grafen v. Platen (AGP) für Forschungsdateninfrastrukturen (FDI) zugänglich zu machen: die Daten konvertieren, die Metadaten extrahieren, Daten und Metadaten indizieren sowie die Datenmodelle für Daten und Metadaten so ergänzen, dass sie die Bestände des Archivs sinnvoll erfassen. Zugleich wird begründet, weshalb man überhaupt solchen Aufwand treiben sollte: nämlich, damit die Daten einem größeren Publikum zur Verfügung stehen und überdies mit Werkzeugen bearbeitet werden können, die in den Infrastrukturen zur Verfügung stehen, und damit eine weitere Verlinkung und Kombination mit externen Ressourcen erfolgen kann, sodass ein deutlicher Mehrwert entstehen kann.
CLARIN contractual framework for sharing language data: the perspective of personal data protection
(2020)
The article analyses the responsibility for ensuring compliance with the General Data Protection Regulation (GDPR) in research settings. As a general rule, organisations are considered the data controller (responsible party for the GDPR compliance). Research constitutes a unique setting influenced by academic freedom. This raises the question of whether academics could be considered the controller as well. However, there are some court cases and policy documents on this issue. It is not settled yet. The analysis serves a preliminary analytical background for redesigning CLARIN contractual framework for sharing data.
N-grams are of utmost importance for modern linguistics and language theory. The legal status of n-grams, however, raises many practical questions. Traditionally, text snippets are considered copyrightable if they meet the originality criterion, but no clear indicators as to the minimum length of original snippets exist; moreover, the solutions adopted in some EU Member States (the paper cites German and French law as examples) are considerably different. Furthermore, recent developments in EU law (the CJEU's Pelham decision and the new right of newspaper publishers) also provide interesting arguments in this debate. The proposed paper presents the existing approaches to the legal protection of n-grams and tries to formulate some clear guidelines as to the length of n-grams that can be freely used and shared.
The CMDI Explorer
(2020)
We present the CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for subsequent processing. The CMDI Explorer hence adds functionality that many users felt was lacking from the CLARIN tool space.
Signposts for CLARIN
(2020)
An implementation of CMDI-based signposts and its use is presented in this paper. Arnold et al. 2020 present Signposts as a solution to challenges in long-term preservation of corpora, especially corpora that are continuously extended and subject to modification, e.g., due to legal injunctions, but also may overlap with respect to constituents, and may be subject to migrations to new data formats. We describe the contribution Signposts can make to the CLARIN infrastructure and document the design for the CMDI profile.