Refine
Year of publication
Document Type
- Conference Proceeding (9)
- Part of a Book (6)
- Article (5)
- Book (1)
- Other (1)
Is part of the Bibliography
- no (22) (remove)
Keywords
- Korpus <Linguistik> (19)
- Institut für Deutsche Sprache <Mannheim> (8)
- Deutsches Referenzkorpus (DeReKo) (6)
- Deutsch (5)
- Textkorpus (4)
- Korpusanalyseplattform (KorAP) (3)
- Textlinguistik (3)
- Computerlinguistik (2)
- Annotation (1)
- Aufsatzsammlung (1)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (11)
- Peer-Review (1)
Publisher
- European Language Resources Association (ELRA) (4)
- Institut für Deutsche Sprache (2)
- Narr (2)
- de Gruyter (2)
- Benjamins (1)
- ELRA (1)
- Fink (1)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (1)
- Nisaba (1)
- Oxford University Press (1)
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology.
This paper describes the efforts in the field of sustainability of the Institut für Deutsche Sprache (IDS) in Mannheim with respect to DEREKO (Deutsches Referenzkorpus) the Archive of General Reference Corpora of Contemporary Written German. With focus on re-usability and sustainability, we discuss its history and our future plans. We describe legal challenges related to the creation of a large and sustainable resource; sketch out the pipeline used to convert raw texts to the final corpus format and outline migration plans to TEI P5. Due to the fact, that the current version of the corpus management and query system is pushed towards its limits, we discuss the requirements for a new version which will be able to handle current and future DEREKO releases. Furthermore, we outline the institute’s plans in the field of digital preservation.
^This paper describes DeReKo (Deutsches Referenzkorpus), the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS) in Mannheim, and the rationale behind its development. We discuss its design, its legal background, how to access it, available metadata, linguistic annotation layers, underlying standards, ongoing developments, and aspects of using the archive for empirical linguistic research. The focus of the paper is on the advantages of DEREKO’s design as a primordial sample from which virtual corpora can be drawn for the specific purposes of individual studies. Both concepts, primordial sample and virtual corpus are explained and illustrated in detail. Furthermore, we describe in more detail how DEREKO deals with the fact that all its texts are subject to third parties’ intellectual property rights, and how it deals with the issue of replicability, which is particularly challenging given DEREKO’s dynamic growth and the possibility to construct from it an open number of virtual corpora.
Empirical synchronic language studies generally seek to investigate language phenomena for one point in time, even though this point in time is often not stated explicitly. Until today, surprisingly little research has addressed the implications of this time-dependency of synchronic research on the composition and analysis of data that are suitable for conducting such studies. Existing solutions and practices tend to be too general to meet the needs of all kinds of research questions. In this theoretical paper that is targeted at both corpus creators and corpus users, we propose to take a decidedly synchronic perspective on the relevant language data. Such a perspective may be realised either in terms of sampling criteria or in terms of analytical methods applied to the data. As a general approach for both realisations, we introduce and explore the FReD strategy (Frequency Relevance Decay) which models the relevance of language events from a synchronic perspective. This general strategy represents a whole family of synchronic perspectives that may be customised to meet the requirements imposed by the specific research questions and language domain under investigation.
Korpuslinguistik
(2012)
The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.