Computerlinguistik
Refine
Year of publication
Document Type
- Part of a Book (126) (remove)
Keywords
- Computerlinguistik (33)
- Korpus <Linguistik> (29)
- Annotation (26)
- Deutsch (14)
- Automatische Sprachverarbeitung (12)
- Automatische Sprachanalyse (11)
- Sprachdaten (10)
- XML (10)
- Digital Humanities (9)
- Texttechnologie (9)
Publicationstate
- Veröffentlichungsversion (74)
- Zweitveröffentlichung (34)
- Postprint (18)
- Preprint (1)
Reviewstate
- (Verlags)-Lektorat (80)
- Peer-Review (34)
- Peer-review (2)
- Verlags-Lektorat (1)
Publisher
Ziel dieses Projekts ist es, Sprachdaten so nah wie möglich am Jetzt zu erheben und analysierbar zu machen. Wir möchten, dass möglichst viele Menschen, nicht nur Sprachwissenschaftlerinnen und Sprachwissenschaftler, in die Lage versetzt werden, Sprachdaten zu explorieren und zu nutzen. Hierzu erheben wir ein Korpus, d. h. eine aufbereitete Sammlung von Sprachdaten von RSS-Feeds deutschsprachiger Onlinequellen. Wir zeichnen die Entwicklung der Analysewerkzeuge von einem Prototyp hin zur aktuellen Form der Anwendung nach, die eine komplette Reimplementierung darstellt. Dabei gehen wir auf die Architektur, einige Analysebeispiele sowie Erweiterungsmöglichkeiten ein. Fragen der Skalierbarkeit und Performanz stehen dabei im Mittelpunkt. Unsere Darstellungen lassen sich daher auf andere Data-Science-Projekte verallgemeinern.
L’article intitulé «Traitement de l’information: Spinfo, HKI et humanités numériques - l’expérience de Cologne» présente l’histoire du développement des humanités numériques au sein de l’Université de Cologne. L'institutionnalisation des humanités numériques a commencé encore à l’époque où dans le monde germanophone le périmètre de la discipline était en train d’être défini par les travaux de quelques pionniers. Parmi eux, il convient de souligner le rôle d’Elisabeth Burr, active notamment à Tubingue, Duisbourg, Brême et Leipzig.L’article retrace le développement des humanités numériques à Cologne à partir de leurs débuts dans les années soixante du 20ème siècle, en passant par leur consolidation dans les années quatre-vingt-dix, jusqu’aux deux dernières décennies, quand Cologne est devenu un centre important de cette discipline. Le processus illustre comment une nouvelle discipline scientifique peut s’institutionnaliser au sein d’une université allemande. L’article décrit la perspective de deux domaines fondateurs: le traitement linguistique de l’information (en allemand: Sprachliche Informationsverarbeitung, Spinfo) et le traitement historico-culturel de l’information (en allemand: Historisch Kulturwissenschaftliche Informationsverarbeitung, HKI) et leur synthèse, qui a abouti en 2017 à la création de l’Institut des Humanités Numériques (Digital Humanities), qui aujourd’hui est - du point de vue interne - une composante de la Faculté de Philosophie de l’Université de Cologne et - du point de vue externe - une partie intégrante de la communauté internationale des humanités numériques.
In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data management, we evaluate the 20 European CMC corpora listed in the CLARIN CMC Resource family, individuate successful strategies among the existing corpora and establish best practices for future projects. We give an overview of existing approaches to data referencing, dissemination and provision in European CMC corpora, and discuss the methods, formats and strategies used. Furthermore, we discuss the need for community standards and offer recommendations for best practices when creating a new CMC corpus.
The paper presents the results of empirical research conducted with students from the Faculty of Translation studies of Ventspils University of Applied Sciences (VUAS) in Latvia. The study investigates the habits and practices concerning the use of dictionaries on the part of translation students, as well as types of dictionaries used, frequency of use, etc. The study also presents an insight into the evaluation of the usefulness of dictionaries by Latvian students. The research describes the advantages and disadvantages of dictionaries used by the respondents, the importance of the preface and the explanation of the terms and abbreviations used in dictionaries. The research conducted, as well as the insights, results and recommendations presented, will be relevant for the lexicographic community, as it reflects the experience of one Latvian University to improve the teaching of dictionary use and lexicographic culture in this country and to complement dictionary use research with the Latvian experience.
This paper describes a method for extracting collocation data from text corpora based on a formal definition of syntactic structures, which takes into account not only the POS-tagging level of annotation but also syntactic parsing (syntactic treebank model) and introduces the possibility of controlling the canonical form of extracted collocations based on statistical data on forms with different properties in the corpus. Specifically, we describe the results of extraction from the syntactically tagged Gigafida 2.1 corpus. Using the new method, 4,002,918 collocation candidates in 81 syntactic structures were extracted. We evaluate the extracted data sample in more detail, mainly in relation to properties that affect the extraction of canonical forms: definiteness in adjectival collocations, grammatical number in noun collocations, comparison in adjectival and adverbial collocations, and letter case (uppercase and lowercase) in canonical forms. The conclusion highlights the potential of the methodology used for the grammatical description of collocation and phrasal syntax and the possibilities for improving the model in the process of compilation of a digital dictionary database for Slovene.
The normative layer of CLARIN is, alongside the organizational and technical layers, an essential part of the infrastructure. It consists of the regulatory framework (statutory law, case law, authoritative guidelines, etc.), the contractual framework (licenses, terms of service, etc.), and ethical norms. Navigating the normative layer requires expertise, experience, and qualified effort. In order to advise the Board of Directors, a standing committee dedicated to legal and ethical issues, the CLIC, was created. Since its establishment in 2012, the CLIC has made considerable efforts to provide not only the BoD but also the general public with information and guidance. It has published many articles (both in proceedings of CLARIN conferences and in its own White Paper Series) and developed several LegalTech tools. It also runs a Legal Information Platform, where accessible information on various issues affecting language resources can be found.
This chapter will present lessons learned from CLARIN-D, the German CLARIN national consortium. Members of the CLARIN-D communities and of the CLARIN-D consortium have been engaged in innovative, data-driven, and community-based research, using language resources and tools in the humanities and neigh-bouring disciplines. We will present different use cases and users’ stories that demonstrate the innovative research potential of large digital corpora and lexical resources for the study of language change and variation, for language documentation, for literary studies, and for the social sciences. We will emphasize the added value of making language resources and tools available in the CLARIN distributed research infrastructure and will discuss legal and ethical issues that need to be addressed in the use of such an infrastructure. Innovative technical solutions for accessing digital materials still under copyright and for data mining such materials will be presented. We will outline the need for close interaction with communities of interest in the areas of curriculum development, data management, and training the next generation of digital humanities scholars. The importance of community-supported standards for encoding language resources and the practice of community-based quality control for digital research data will be presented as a crucial step toward the provisioning of high quality research data. The chapter will conclude with a discussion of impor-tant directions for innovative research and for supporting infrastructure development over the next decade and beyond.
Standards in CLARIN
(2022)
This chapter looks at a fragment of the ongoing work of the CLARIN Standards Committee (CSC) on producing a shared set of recommendations on standards, formats, and related best practices supported by the CLARIN infrastructure and its participating centres. What might at first glance seem to be a straightforward goal has over the years proven to be rather complex, reflecting the robustness and heterogeneity of the emerging distributed digital research infrastructure and the various disciplines and research traditions of the language-based humanities that it serves and represents, and therefore part of the chapter reviews the various initiatives and proposals that strove to produce helpful standards-related guidance. The focus turns next to a subtask initiated in late 2019, its scope narrowed to one of the core activities and responsibilities of CLARIN backbone centres, namely the provision of data deposition services. Centres are obligated to publish their recom-mendations concerning the repertoire of data formats that are best suited for their research profiles. We look at how this requirement has been met by the particular centres and suggest that having centres maintain their information in the Standards Information System (SIS) is the way to improve on the current state of affairs.
Preface
(2022)
CLARIN stands for “Common Language Resources and Technology Infrastructure”. In 2012 CLARIN ERIC was established as a legal entity with the mission to create and maintain a digital infrastructure to support the sharing, use, and sustainability of language data (in written, spoken, or multimodal form) available through repositories from all over Europe, in support of research in the humanities and social sciences and beyond. Since 2016 CLARIN has had the status of Landmark research infrastructure and currently it provides easy and sustainable access to digital language data and also offers advanced tools to discover, explore, exploit, annotate, analyse, or combine such datasets, wherever they are located. This is enabled through a networked federation of centres: language data repositories, service centres, and knowledge centres with single sign-on access for all members of the academic community in all participating countries. In addition, CLARIN offers open access facilities for other interested communities of use, both inside and outside of academia. Tools and data from different centres are interoperable, so that data collections can be combined and tools from different sources can be chained to perform operations at different levels of complexity. The strategic agenda adopted by CLARIN and the activities undertaken are rooted in a strong commitment to the Open Science paradigm and the FAIR data principles. This also enables CLARIN to express its added value for the European Research Area and to act as a key driver of innovation and contributor to the increasing number of industry programmes running on data-driven processes and the digitalization of society at large.