OPUS 4 | Korpuslinguistik

The IVK-Ler corpus of adolescent foreign-language learners of German (2023)

This paper presents the IVK-Ler corpus, a longitudinal, annotated learner corpus of weekly writings produced by a group of 18 adolescents in a preparatory class. The corpus consists of 117 student texts collected between 2020 and 2021 and has a structure layered by student and text number. It includes metadata that enables researchers to analyze and track individual student progress in terms of syntactic competence and literacy. The annotation schema, manual and automatic annotation processes, and corpus representation are described in detail. The corpus currently includes target hypotheses and gold standard part-of-speech tags. Future work could include additional annotation layers for topological fields and dependency relations, as well as semantic and discourse annotations to make the corpus usable for tasks beyond syntactic evaluations.

Rover und TüNDRA: Such- und Visualisierungsplattformen für Wortnetze und Baumbanken (2021)

Hinrichs, Erhard ; Fischer, Patricia ; Strakatova, Yana

Geeignete Such- und Visualisierungswerkzeuge, idealiter in Form von Webapplikationen, sind für den benutzerfreundlichen Zugang zu Sprachressourcen von großer Bedeutung. In diesem Beitrag stellen wir die Webapplikationen Rover und TüNDRA vor, die am CLARIN-D Zentrum Tübingen im Rahmen des BMBF-Projekts CLARIN-D entwickelt wurden.

Multilinguale Sprachressourcen für die linguistische Forschung (2021)

Hinrichs, Erhard

Sprachressourcen in digitaler Form liegen für ein immer breiteres Spektrum von Einzelsprachen vor. Linguistisch annotierte Korpora ermöglichen es, gezielt nach linguistischen Mustern auf der Wort-, Phrasen-, und Satzebene zu suchen und in quantitativer und qualitativer Hinsicht auszuwerten. In diesem Beitrag illustriere ich anhand von ausgewählten Beispielen den Mehrwert, den annotierte Textkorpora für die sprachwissenschaftliche Forschung bieten können. Viele der vorgestellten Sprachressourcen werden im Rahmen der CLARIN-Infrastruktur nachhaltig zur Verfügung gestellt. Die Korpora sind entweder durch Suchportale recherchierbar oder werden per Download zur Verfügung gestellt.

Text+: Sprach- und textbasierte Forschungsdateninfrastruktur (2020)

Hinrichs, Erhard ; Henrich, Andreas ; Rapp, Andrea ; Stein, Regine

Language Resources, Taxonomies and Metadata (2009)

Lemnitzer, Lothar ; Hinrichs, Erhard ; Witt, Andreas

In this paper we present an approach to faceted search in large language resource repositories. This kind of search which enables users to browse through the repository by choosing their personal sequence of facets heavily relies on the availability of descriptive metadata for the objects in the repository. This approach therefore informs the collection of a minimal set of metatdata for language resources. The work described in this paper has been funded by the EC within the ESFRI infrastructure project CLARIN.

Connecting resources: Which issues have to be solved to integrate CMC corpora from heterogeneous sources and for different languages? (2017)

Beißwenger, Michael ; Wigham, Ciara ; Etienne, Carole ; Fišer, Darja ; Grumt Suárez, Holger ; Herzberg, Laura ; Hinrichs, Erhard ; Horsmann, Tobias ; Karlova-Bourbonus, Natali ; Lemnitzer, Lothar ; Longhi, Julien ; Lüngen, Harald ; Ho-Dac, Lydia-Mai ; Parisse, Christophe ; Poudat, Céline ; Schmidt, Thomas ; Stemle, Egon W. ; Storrer, Angelika ; Zesch, Torsten

The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.

Collecting Legally Relevant Metadata by Means of a Decision-Tree-Based Questionnaire System (2007)

Lehmberg, Timm ; Chiarcos, Christian ; Hinrichs, Erhard ; Rehm, Georg ; Witt, Andreas

Digital Text Resources for the Humanities – Legal Issues (2007)

Rehm, Georg ; Witt, Andreas ; Hinrichs, Erhard ; Lehmberg, Timm ; Chiarcos, Christian ; Zimmermann, Felix ; Zinsmeister, Heike ; Dellert, Johannes

Sustainability of annotated resources in linguistics: A web-platform for exploring, querying, and distributing linguistic corpora and other resources (2009)

Rehm, Georg ; Schonefeld, Oliver ; Witt, Andreas ; Hinrichs, Erhard ; Reis, Marga

We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by the authors is described.

Open Access

Korpuslinguistik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

9 search hits