Refine
Document Type
- Conference Proceeding (17)
- Part of a Book (10)
- Working Paper (3)
- Book (2)
- Article (1)
Language
- English (33) (remove)
Keywords
- Korpus <Linguistik> (19)
- Corpus linguistics (15)
- Corpus technology (12)
- Englisch (7)
- Large corpora (7)
- Annotation (6)
- Corpus annotation (6)
- Datenbanksystem (6)
- Corpus management (5)
- Internet (5)
Publicationstate
Reviewstate
- Peer-Review (20)
- (Verlags)-Lektorat (2)
- Verlags-Lektorat (1)
Publisher
- Institut für Deutsche Sprache (33) (remove)
The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue that web corpora currently offer the best source of empirical material for studying Clitic Climbing in BCS. They thus allow the most accurate description of this phenomenon, as less frequent constructions can be tracked only in big, well-annotated data sources. We compare the properties of web corpora for BCS with traditional sources and give examples of studies on CC based on web corpora. Furthermore, we discuss problems related to web corpora and suggest some improvements for the future.
One of the specific historical and cultural characteristics of the Russian political discourse is its orientation to precedents. It is considered correct to follow the behaviouristic models shown by one of the “heroes” (Peter I, Lenin, Stalin, etc.), to reproduce standard texts, and to compare the present situations with past situations (The Time of Troubles, Weimar Republic, NEP “New Economic Policy” (1921-1928), etc.). One of the peculiarities of the present time in Russia is the deep conflict between different social groups orientated to different precedents. Each group has its own variant of the national myth using the same means of the language for actualisation of this myth. Therefore, it is very important to analyse changes in the national cognitive foundation. Precedential phenomena are the central components of this foundation.
Who understands Low German today and who can speak it? Who makes use of media and cultural events in Low German? What images do people in northern Germany associate with Low German and what is their view of their regional language?
These and further questions are answered in this brochure with the help of representative data collected in a telephone survey of a total of 1,632 people from eight federal states (Bremen, Hamburg, Lower Saxony, Mecklenburg-West Pomerania and Schleswig-Holstein as well as Brandenburg, North Rhine-Westphalia and Saxony-Anhalt).
Discourse analysis in general, and media discourse analysis in particular, are currently attracting increased attention from linguists. This interest can be seen in the tendency to apply the term ‘discourse’ to various sciences and academic disciplines. It is possible to trace its dispersion both horizontally, i.e. in different sciences, and vertically, i.e. on various linguistic levels. Furthermore, the majority of interpretations of the term ‘discourse’ appearing in the works of modern scholars have arisen as a result of the interdisciplinary nature of language study within the cognitive paradigm in linguistics.
Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora.
The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.
In a project called "A Library of a Billion Words" we needed an implementation of the CTS protocol that is capable of handling a text collection containing at least 1 billion words. Because the existing solutions did not work for this scale or were still in development I started an implementation of the CTS protocol using methods that MySQL provides. Last year we published a paper that introduced a prototype with the core functionalities without being compliant with the specifications of CTS (Tiepmar et al., 2013). The purpose of this paper is to describe and evaluate the MySQL based implementation now that it is fulfilling the specifications version 5.0 rc.1 and mark it as finished and ready to use. Further information, online instances of CTS for all described datasets and binaries can be accessed via the projects website.