Refine
Year of publication
- 2015 (25) (remove)
Document Type
- Conference Proceeding (15)
- Part of a Book (4)
- Article (3)
- Book (1)
- Master's Thesis (1)
- Other (1)
Has Fulltext
- yes (25)
Keywords
- Korpus <Linguistik> (25) (remove)
Publicationstate
- Veröffentlichungsversion (25) (remove)
Reviewstate
Publisher
- Institut für Deutsche Sprache (10)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Association for Computational Linguistics (1)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- Gesellschaft für Sprachtechnologie and Computerlinguistik (1)
- International Phonetic Association (IPA) (1)
- Linköping University Electronic Press, Linköpings universitet (1)
- Stauffenburg (1)
- The Association for Computational Linguistics (1)
- de Gruyter (1)
Speakers’ linguistic experience is for the most part experience with language as used in conversational interaction. Though highly relevant for usage-based linguistics, the study of such data is as yet often left to other frameworks such as conversation analysis and interactional linguistics (Couper-Kuhlen and Selting 2001). On the basis of a case study of salient usage patterns of the two German motion verbs kommen and gehen in spontaneous conversation, the present paper argues for a methodological integration of quantitative corpus-linguistic methods with qualitative conversation analytic approaches to further the usage-based study of conversational interaction.
Opinion Holder and Target Extraction for Verb-based Opinion Predicates – The Problem is Not Solved
(2015)
We offer a critical review of the current state of opinion role extraction involving opinion verbs. We argue that neither the currently available lexical resources nor the manually annotated text corpora are sufficient to appropriately study this task. We introduce a new corpus focusing on opinion roles of opinion verbs from the Subjectivity Lexicon and show potential benefits of this corpus. We also demonstrate that state-of-the-art classifiers perform rather poorly on this new dataset compared to the standard dataset for the task showing that there still remains significant research to be done.
This article reports on the on-going CoRoLa project, aiming at creating a reference corpus of contemporary Romanian (from 1945 onwards), opened for online free exploitation by researchers in linguistics and language processing, teachers of Romanian, students. We invest serious efforts in persuading large publishing houses and other owners of IPR on relevant language data to join us and contribute the project with selections of their text and speech repositories. The CoRoLa project is coordinated by two Computer Science institutes of the Romanian Academy, but enjoys cooperation of and consulting from professional linguists from other institutes of the Romanian Academy. We foresee a written component of the corpus of more than 500 million word forms, and a speech component of about 300 hours of recordings. The entire collection of texts (covering all functional styles of the language) will be pre-processed and annotated at several levels, and also documented with standardized metadata. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will include morpho-lexical tagging and lemmatization in the first stage, followed by syntactic, semantic and discourse annotation in a later stage.
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of all linguistic theories that take social interaction involving language into account. However, determining communicative functions is a notoriously difficult task both for human interpreters and systems. It involves an interpretative process that integrates various sources of information. Existing work on communicative function classification comes from either dialogue act tagging where it is generally coarse grained concerning the feed- back phenomena or it is token-based and does not address the variety of forms that feed- back utterances can take. This paper introduces an annotation framework, the dataset and the related annotation campaign (involving 7 raters to annotate nearly 6000 utterances). We present its evaluation not merely in terms of inter-rater agreement but also in terms of usability of the resulting reference dataset both from a linguistic research perspective and from a more applicative viewpoint.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of linguistic theories that take social interaction, involving language, into account. This paper introduces the corpora and datasets of a project scrutinizing this kind of feedback utterances in French. We present the genesis of the corpora (for a total of about 16 hours of transcribed and phone force-aligned speech) involved in the project. We introduce the resulting datasets and discuss how they are being used in on-going work with focus on the form-function relationship of conversational feedback. All the corpora created and the datasets produced in the framework of this project will be made available for research purposes.
Valenz und Kookkurrenz
(2015)