Korpuslinguistik
Refine
Year of publication
- 2017 (27) (remove)
Document Type
- Conference Proceeding (17)
- Part of a Book (6)
- Article (1)
- Book (1)
- Other (1)
- Working Paper (1)
Language
- English (27) (remove)
Keywords
- Korpus <Linguistik> (27)
- Corpus linguistics (11)
- Corpus technology (6)
- Texttechnologie (6)
- Datenmanagement (4)
- Internet (4)
- Deutsch (3)
- Englisch (3)
- Prädikat (3)
- Web corpora (3)
Publicationstate
- Veröffentlichungsversion (24)
- Postprint (2)
Reviewstate
- Peer-Review (20)
- Peer-review (4)
Publisher
- Institut für Deutsche Sprache (12)
- Leibniz-Institut für Deutsche Sprache (IDS) (2)
- University of Birmingham (2)
- Charles University (1)
- De Gruyter (1)
- Editions Tradulex (1)
- European Network of e-Lexicography (ENeL) (1)
- Izdatel´stvo Sankt-Peterburgskogo gosudarstvennogo universiteta (1)
- Lexical Computing CZ s.r.o. (1)
- Linköping University (1)
Making 1:n explorable: a search interface for the ZAS database of clause-embedding predicates
(2017)
We introduce a recently published corpus-based database of German clause-embedding predicates and present an innovative web application for exploring it. The application displays the predicates and the corpus examples for these predicates in two separate tables that can be browsed and searched in real time. While familiar web interface paradigms make it easy for users to get started, the data presentation and the interactive advanced search components for the two tables are designed to accommodate remarkably complex query needs without the need for resorting to a dedicated query language or a more specialized tool. The 1:n relationship between predicates and their examples is exploited in the two tables in that, e.g. the predicate table also shows, for each predicate and each example attribute, all values that occur in the examples for this predicate. An easy-to-use visual query builder for arbitrary Boolean combinations of search criteria can optionally be displayed to pre-filter the underlying data presented in both tables. Several options for altering quantifier scope can be activated with simple checkboxes and considerably widen the space of searchable constellations.
CoMParS is a resource under construction in the context of the long-term project German Grammar in European Comparison (GDE) at the IDS Mannheim. The principal goal of GDE is to create a novel contrastive grammar of German against the background of other European languages. Alongside German, which is the central focus, the core languages for comparison are English, French, Hungarian and Polish, representing different typological classes. Unlike traditional contrastive grammars available for German, which usually cover language pairs and are based on formal grammatical categories, the new GDE grammar is developed in the spirit of functionalist typology. This implies that, instead of formal criteria, cognitively motivated functional domains in terms of Givón (1984) are used as tertia comparationis. The purpose of CoMParS is to document the empirical basis of the theoretical assumptions of GDE-V and to illustrate the otherwise rather abstract content of grammar books by as many as possible naturally occurring and adequately presented multilingual examples, including information on their use in specific contexts and registers. These examples come from existing parallel corpora, and our presentation will focus on the legal aspects and consequences of this choice of language data.
Our paper describes an experiment aimed to assessment of lexical coverage in web corpora in comparison with the traditional ones for two closely related Slavic languages from the lexicographers’ perspective. The preliminary results show that web corpora should not be considered ― inferior, but rather ― different.
Contents:
1. Andreas Dittrich: Intra-connecting a small exemplary literary corpus with semantic web technologies for exploratory literary studies, S. 1
2. John Kirk, Anna Čermáková: From ICE to ICC: The new International Comparable Corpus, S. 7
3. Dawn Knight, Tess Fitzpatrick, Steve Morris, Jeremy Evas, Paul Rayson, Irena Spasic, Mark Stonelake, Enlli Môn Thomas, Steven Neale, Jennifer Needs, Scott Piao, Mair Rees, Gareth Watkins, Laurence Anthony, Thomas Michael Cobb, Margaret Deuchar, Kevin Donnelly, Michael McCarthy, Kevin Scannell: Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh), S. 13
4. Marc Kupietz, Andreas Witt, Piotr Bański, Dan Tufiş, Dan Cristea, Tamás Váradi: EuReCo - Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research, S. 15
5. Harald Lüngen, Marc Kupietz: CMC Corpora in DeReKo, S. 20
6. David McClure, Mark Algee-Hewitt, Douris Steele, Erik Fredner, Hannah Walser: Organizing corpora at the Stanford Literary Lab, S. 25
7. Radoslav Rábara, Pavel Rychlý ,Ondřej Herman: Accelerating corpus search using multiple cores, S. 30
8. John Vidler, Stephen Wattam: Keeping Properties with the Data: CL-MetaHeaders – An Open Specification, S. 35
9. Vladimir Benko: Are Web Corpora Inferior? The Case of Czech and Slovak, S. 43
10. Edyta Jurkiewicz-Rohrbacher, Zrinka Kolaković, Björn Hansen: Web Corpora – the best possible solution for tracking phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian, S. 49
11. Vít Suchomel: Removing Spam from Web Corpora Through Supervised Learning Using FastText, S. 56
Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora.
The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.
Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue that web corpora currently offer the best source of empirical material for studying Clitic Climbing in BCS. They thus allow the most accurate description of this phenomenon, as less frequent constructions can be tracked only in big, well-annotated data sources. We compare the properties of web corpora for BCS with traditional sources and give examples of studies on CC based on web corpora. Furthermore, we discuss problems related to web corpora and suggest some improvements for the future.
CMC Corpora in DeReKo
(2017)
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively.
Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - The National Corpus of Contemporary Welsh)
(2017)
CorCenCC is an interdisciplinary and multiinstitutional project that is creating a large-scale, open-source corpus of contemporary Welsh. CorCenCC will be the first ever large-scale corpus to represent spoken, written and electronicallymediated Welsh (compiling an initial data set of 10 million Welsh words), with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups.
Corpus researchers, along with many other disciplines in science are being put under continual pressure to show accountability and reproducibility in their work. This is unsurprisingly difficult when the researcher is faced with a wide array of methods and tools through which to do their work; simply tracking the operations done can be problematic, especially when toolchains are often configured by the developers, but left largely as a black box to the user. Here we present a scheme for encoding this ‘meta data’ inside the corpus files themselves in a structured data format, along with a proof-of-concept tool to record the operations performed on a file.