Refine
Document Type
- Part of a Book (7)
- Conference Proceeding (6)
- Article (1)
Has Fulltext
- yes (14)
Keywords
- Sprachdaten (14) (remove)
Publicationstate
- Veröffentlichungsversion (10)
- Postprint (3)
- Zweitveröffentlichung (3)
Reviewstate
- (Verlags)-Lektorat (8)
- Peer-Review (6)
Publisher
- de Gruyter (3)
- European Language Resources Association (ELRA) (2)
- Clarin (1)
- European language resources association (ELRA) (1)
- Nyelvtudományi Kutatóközpont / Hungarian Research Centre for Linguistics (1)
- Oxford University Press (1)
- Technische Informationsbibliothek (1)
- University of Oulu (1)
- Universität Leipzig (1)
- Wilhelm Fink (1)
The Leibniz-Institute for the German Language (IDS) was established in Mannheim in 1964. Since then, it has been at the forefront of innovation in German linguistics as a hub for digital language data. This chapter presents various lessons learnt from over five decades of work by the IDS, ranging from the importance of sustainability, through its strong technical base and FAIR principles, to the IDS’ role in national and international cooperation projects and its expertise on legal and ethical issues related to language resources and language technology.
Das vom BMBF geförderte Verbundprojekt CLARIAH-DE, an dem über 25 Partnerinstitutionen mitwirken, unter ihnen auch das IDS, hat zum Ziel, mit der Entwicklung einer Forschungsinfrastruktur zahlreiche Angebote zur Verfügung zu stellen, die die Bedingungen der Forschungsarbeit mit digitalen Werkzeugen, Diensten sowie umfangreichen Datenbeständen im Bereich der geisteswissenschaftlichen Forschung und benachbarter Disziplinen verbessern. Die in CLARIAH-DE entwickelte Infrastruktur bietet den Forschenden Unterstützung bei der Analyse und Aufbereitung von Sprachdaten für linguistische Untersuchungen in unterschiedlichsten Anwendungskontexten und leistet somit einen Beitrag zur Entwicklung der NFDI.
The Meta-data-Database of a Next Generation Sustainability Web-Platform for Language Resources
(2008)
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by the authors is described.
Ethical issues in Language Resources and Language Technology are often invoked, but rarely discussed. This is at least partly because little work has been done to systematize ethical issues and principles applicable in the fields of Language Resources and Language Technology. This paper provides an overview of ethical issues that arise at different stages of Language Resources and Language Technology development, from the conception phase through the construction phase to the use phase. Based on this overview, the authors propose a tentative taxonomy of ethical issues in Language Resources and Language Technology, built around five principles: Privacy, Property, Equality, Transparency and Freedom. The authors hope that this tentative taxonomy will facilitate ethical assessment of projects in the field of Language Resources and Language Technology, and structure the discussion on ethical issues in this domain, which may eventually lead to the adoption of a universally accepted Code of Ethics of the Language Resources and Language Technology community.
Open Science and language data: Expectations vs. reality. The role of research data infrastructures
(2023)
Language data are essential for any scientific endeavor. However, unlike numerical data, language data are often protected by copyright, as they easily meet the threshold of originality. The role of research infrastructures (such CLARIN, DARIAH, and Text+) is to bridge the gap between uses allowed by statutory exceptions and the requirements of Open Science. This is achieved on the one hand by sharing language data produced by research organisations with the widest possible circle of persons, and on the other by mutualizing efforts towards copyright clearance and appropriate licensing of datasets.
Was darf die sprachwissenschaftliche Forschung? Juristische Fragen bei der Arbeit mit Sprachdaten
(2022)
Sich in der Linguistik mit rechtlichen Themen beschäftigen zu müssen, ist auf den ersten Blick überraschend. Da jedoch in den Sprachwissenschaften empirisch gearbeitet wird und Sprachdaten, insbesondere Texte und Ton- und Videoaufnahmen sowie Transkripte gesprochener Sprache, in den letzten Jahren auch verstärkt Sprachdaten internetbasierter Kommunikation, als Basis für die linguistische Forschung dienen, müssen rechtliche Rahmenbedingungen für jede Art von Datennutzung beachtet werden. Natürlich arbeiten auch andere Wissenschaften, wie z. B. die Astronomie oder die Meteorologie, empirisch. Jedoch gibt es einen grundsätzlichen Unterschied der empirischen Basis: Im Gegensatz zu Temperaturen, die gemessen, oder Konstellationen von Himmelskörpern, die beobachtet werden, basieren Sprachdaten auf schriftlichen, mündlichen oder gebärdeten Äußerungen von Menschen, wodurch sich juristisch begründete Beschränkungen ihrer Nutzung ergeben.
This abstract discusses the possibility to adopt a CLARIN Data Protection Code of Conduct pursuant art. 40 of the General Data Protection Regulation. Such a code of conduct would have important benefits for the entire language research community. The final section of this abstract proposes a roadmap to the CLARIN Data Protection Code of Conduct, listing various stages of its drafting and approval procedures.