Korpuslinguistik
Refine
Year of publication
- 2021 (34) (remove)
Document Type
- Part of a Book (12)
- Conference Proceeding (10)
- Article (9)
- Book (1)
- Other (1)
- Report (1)
Keywords
- Korpus <Linguistik> (32)
- Forschungsdaten (9)
- Deutsch (7)
- Gesprochene Sprache (6)
- Computerlinguistik (5)
- Datenmanagement (5)
- Urheberrecht (5)
- corpus linguistics (5)
- Kontrastive Linguistik (4)
- CLARIN (3)
Publicationstate
- Veröffentlichungsversion (21)
- Zweitveröffentlichung (13)
- Postprint (1)
Reviewstate
- Peer-Review (19)
- (Verlags)-Lektorat (13)
Publisher
- de Gruyter (10)
- Linköping University Electronic Press (6)
- Leibniz-Institut für Deutsche Sprache (4)
- Leibniz-Institut für Deutsche Sprache (IDS) (3)
- CLARIN (1)
- Deutsche Gesellschaft für Sprachwissenschaft (1)
- Edinburgh University Press (1)
- Erich Schmidt (1)
- Institute of the Polish Language (1)
- Narr Francke Attempto (1)
Validating the Performativity Hypothesis to Neg-Raising using corpus data: Evidence from Polish
(2021)
The main aim of this contribution is to present the range of lexicographic information from LeGeDe, an electronic prototype for lexical and interactional features of spoken German. The focus lies on the detailed description of the different lexicographical information classes using illustrative examples and figures from the resource. In addition to highlighting the lexicographic microstructure and providing an overview of the outer texts and the multimedia information offer, the contribution also presents detailed background data on the conception of the LeGeDe resource. Innovative aspects and possible applications are outlined and forward-looking desiderata are offered.
Die Gemeinsame Wissenschaftskonferenz hat den Verbund „Text+“ bewilligt. „Text+“ hat sich zum Ziel gesetzt, text- und sprachbasierte Forschungsdaten langfristig zu erhalten und ihre breite Nutzung in der Wissenschaft zu ermöglichen. Die Initiative startet somit nach mehrjähriger Vorbereitungszeit und wird zunächst für fünf Jahre durch die Deutsche Forschungsgemeinschaft gefördert.
Die LeGeDe-Ressource: korpusbasierte lexikografische Einblicke und anwendungsorientierte Ausblicke
(2021)
Der Beitrag stellt die lexikografische Online-Ressource LeGeDe, den ersten korpusbasierten Prototypen für Besonderheiten der Lexik des Deutschen in der Interaktion vor. Dabei werden sowohl die Herausforderungen an das innovative Projekt thematisiert als auch Möglichkeiten für einen anwendungsorientierten Nutzen im DaF- und DaZ-Bereich diskutiert und als Ausblick Desiderata für die weitere Beschäftigung mit der lexikografischen Kodifizierung gesprochensprachlicher Spezifika des Deutschen aufgezeigt.
Auf dem Weg zu einer Kartographie: automatische und manuelle Analysen am Beispiel des Korpus ISW
(2021)
This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST also develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are a questionnaire and automatic quality assurance for depositors of language resources, both developed as web applications. They are accompanied by a knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we consider three main data maturity levels in order to decide on a suitable level of strictness of the quality assurance. This division has been introduced to avoid that a set of ideal quality criteria prevent researchers from depositing or even assessing their (legacy) data. The tools described in the paper are work in progress and are expected to be released by the end of the QUEST project in 2022.
The article focuses on determining responsible parties and the division of potential liability arising from sharing language data (LD) containing personal data (PD). A key issue here is to identify who has to make sure and guarantee the GDPR compliance. The authors aim to answer 1) whether an individual researcher is a controller and 2) whether sharing LD results in joint controllership or separate controllership (whether the data's transferee becomes the controller, the joint controller or the processor). The article also analyses the legal relations of parties involved in data sharing and potential liability. The final section outlines data sharing in the CLARIN context. The analysis serves as a preliminary analytical background for redesigning the CLARIN contractual framework for sharing data.
Towards comprehensive definitions of data quality for audiovisual annotated language resources
(2021)
Though digital infrastructures such as CLARIN have been successfully established and now provide large collections of digital resources, the lack of widely accepted standards for data quality and documentation still makes re-use of research data a difficult endeavour, especially for more complex resource types. The article gives a detailed overview over relevant characteristics of audiovisual annotated language resources and reviews possible approaches to data quality in terms of their suitability for the current context. Conclusively, various strategies are suggested in order to arrive at comprehensive and adequate definitions of data quality for this specific resource type and possibly for digital language resources in general.
N-grams are of utmost importance for modern linguistics and language technology. The legal status of n-grams, however, raises many practical questions. Traditionally, text snippets are considered copyrightable if they meet the originality criterion, but no clear indicators as to the minimum length of original snippets exist; moreover, the solutions adopted in some EU Member States (the paper cites German and French law as examples) are considerably different. Furthermore, recent developments in EU law (the CJEU's Pelham decision and the new right of press publishers) also provide interesting arguments in this debate. The paper presents the existing approaches to the legal protection of n-grams and tries to formulate some clear guidelines as to the length of n-grams that can be freely used and shared.