Refine
Document Type
- Conference Proceeding (16)
- Article (6)
- Periodical (1)
- Review (1)
Is part of the Bibliography
- no (24) (remove)
Keywords
- Korpus <Linguistik> (16)
- Corpus linguistics (11)
- Corpus technology (9)
- Large corpora (6)
- Annotation (5)
- Corpus annotation (5)
- Datenbanksystem (5)
- Deutsch (4)
- Texttechnologie (4)
- Corpus management (3)
Publicationstate
Reviewstate
- Peer-Review (24) (remove)
Publisher
- Institut für Deutsche Sprache (24) (remove)
The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue that web corpora currently offer the best source of empirical material for studying Clitic Climbing in BCS. They thus allow the most accurate description of this phenomenon, as less frequent constructions can be tracked only in big, well-annotated data sources. We compare the properties of web corpora for BCS with traditional sources and give examples of studies on CC based on web corpora. Furthermore, we discuss problems related to web corpora and suggest some improvements for the future.
Das vorliegende Themenheft widmet sich einigen „sperrigen“ Phänomenen des Deutschen, die sowohl für eine gebrauchsorientierte Grammatikographie als auch für eine stärker theoriebasierte Linguistik Herausforderungen darstellen, weil sie nicht recht in die gängigen Erklärungsmuster und Kategorienraster passen wollen oder gegen vermeintlich allgemeine Prinzipien und Regeln zu verstoßen scheinen. Als „Ausnahmen“ von Grammatiken (und ihren Benutzern) schon immer diskriminiert, als „Randphänomene“ lange Zeit im toten Winkel einer syntaxzentrierten neueren Linguistik, wird ihnen seit kurzer Zeit - gerade auch von Seiten der sprachtheoretisch orientierten Linguistik - wieder mehr Aufmerksamkeit zuteil.
Sachkunde erwünscht
(1990)
Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora.
The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.
In a project called "A Library of a Billion Words" we needed an implementation of the CTS protocol that is capable of handling a text collection containing at least 1 billion words. Because the existing solutions did not work for this scale or were still in development I started an implementation of the CTS protocol using methods that MySQL provides. Last year we published a paper that introduced a prototype with the core functionalities without being compliant with the specifications of CTS (Tiepmar et al., 2013). The purpose of this paper is to describe and evaluate the MySQL based implementation now that it is fulfilling the specifications version 5.0 rc.1 and mark it as finished and ready to use. Further information, online instances of CTS for all described datasets and binaries can be accessed via the projects website.
The Czech National Corpus (CNC) is a longterm project striving for extensive and continuous mapping of the Czech language. This effort results mostly in compilation, maintenance and providing free public access to a range of various corpora with the aim to offer a diverse, representative, and high-quality data for empirical research mainly in linguistics. Since 2012, the CNC is officially recognized as a research infrastructure funded by the Czech Ministry of Education, Youth and Sports which has caused a recent shift towards user service-oriented operation of the project. All project-related resources are now integrated into the CNC research portal at http://www.korpus.cz/. Currently, the CNC has an established and growing user community of more than 4,500 active users in the Czech Republic and abroad who put almost 1,900 queries per day using one of the user interfaces. The paper discusses the main CNC objectives for each particular domain, aiming at an overview of the current situation supplemented by an outline of future plans.
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish