Refine
Year of publication
- 2015 (25) (remove)
Document Type
- Conference Proceeding (15)
- Part of a Book (4)
- Article (3)
- Book (1)
- Master's Thesis (1)
- Other (1)
Has Fulltext
- yes (25)
Keywords
- Korpus <Linguistik> (25) (remove)
Publicationstate
- Veröffentlichungsversion (25) (remove)
Reviewstate
Publisher
- Institut für Deutsche Sprache (10)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Association for Computational Linguistics (1)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- Gesellschaft für Sprachtechnologie and Computerlinguistik (1)
- International Phonetic Association (IPA) (1)
- Linköping University Electronic Press, Linköpings universitet (1)
- Stauffenburg (1)
- The Association for Computational Linguistics (1)
- de Gruyter (1)
The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
Valenz und Kookkurrenz
(2015)
Speakers’ linguistic experience is for the most part experience with language as used in conversational interaction. Though highly relevant for usage-based linguistics, the study of such data is as yet often left to other frameworks such as conversation analysis and interactional linguistics (Couper-Kuhlen and Selting 2001). On the basis of a case study of salient usage patterns of the two German motion verbs kommen and gehen in spontaneous conversation, the present paper argues for a methodological integration of quantitative corpus-linguistic methods with qualitative conversation analytic approaches to further the usage-based study of conversational interaction.
Tagset und Richtlinie für das PoSTagging von Sprachdaten aus Genres internetbasierter Kommunikation
(2015)
The Czech National Corpus (CNC) is a longterm project striving for extensive and continuous mapping of the Czech language. This effort results mostly in compilation, maintenance and providing free public access to a range of various corpora with the aim to offer a diverse, representative, and high-quality data for empirical research mainly in linguistics. Since 2012, the CNC is officially recognized as a research infrastructure funded by the Czech Ministry of Education, Youth and Sports which has caused a recent shift towards user service-oriented operation of the project. All project-related resources are now integrated into the CNC research portal at http://www.korpus.cz/. Currently, the CNC has an established and growing user community of more than 4,500 active users in the Czech Republic and abroad who put almost 1,900 queries per day using one of the user interfaces. The paper discusses the main CNC objectives for each particular domain, aiming at an overview of the current situation supplemented by an outline of future plans.
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish
Contents:
1. Michal Křen: Recent Developments in the Czech National Corpus, S. 1
2. Dan Tufiş, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu: CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language, S. 5
3. Sebastian Buschjäger, Lukas Pfahler, Katharina Morik: Discovering Subtle Word Relations in Large German Corpora, S. 11
4. Johannes Graën, Simon Clematide: Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, S. 15
5. Stefan Evert, Andrew Hardie: Ziggurat: A new data model and indexing format for large annotated text corpora, S. 21
6. Roland Schäfer: Processing and querying large web corpora with the COW14 architecture, S. 28
7. Jochen Tiepmar: Release of the MySQL-based implementation of the CTS protocol, S. 35
Opinion Holder and Target Extraction for Verb-based Opinion Predicates – The Problem is Not Solved
(2015)
We offer a critical review of the current state of opinion role extraction involving opinion verbs. We argue that neither the currently available lexical resources nor the manually annotated text corpora are sufficient to appropriately study this task. We introduce a new corpus focusing on opinion roles of opinion verbs from the Subjectivity Lexicon and show potential benefits of this corpus. We also demonstrate that state-of-the-art classifiers perform rather poorly on this new dataset compared to the standard dataset for the task showing that there still remains significant research to be done.