Refine
Year of publication
Document Type
- Conference Proceeding (52)
- Article (22)
- Part of a Book (15)
- Review (2)
- Book (1)
- Part of Periodical (1)
- Working Paper (1)
Is part of the Bibliography
- no (94) (remove)
Keywords
- Korpus <Linguistik> (94) (remove)
Publicationstate
- Veröffentlichungsversion (94) (remove)
Reviewstate
- Peer-Review (94) (remove)
Publisher
- Institut für Deutsche Sprache (16)
- IDS-Verlag (13)
- Association for Computational Linguistics (9)
- Leibniz-Institut für Deutsche Sprache (8)
- European Language Resources Association (3)
- GSCL (3)
- Schmidt (3)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (2)
- LiU Electronic Press (2)
- Universitäts- und Landesbibliothek Darmstadt (2)
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.
Ph@ttSessionz and Deutsch heute are two large German speech databases. They were created for different purposes: Ph@ttSessionz to test Internet-based recordings and to adapt speech recognizers to the voices of adolescent speakers, Deutsch heute to document regional variation of German. The databases differ in their recording technique, the selection of recording locations and speakers, elicitation mode, and data processing.
In this paper, we outline how the recordings were performed, how the data was processed and annotated, and how the two databases were imported into a single relational database system. We present acoustical measurements on the digit items of both databases. Our results confirm that the elicitation technique affects the speech produced, that f0 is quite comparable despite different recording procedures, and that large speech technology databases with suitable metadata may well be used for the analysis of regional variation of speech.
We present an implemented XML data model and a new, simplified query language for multi-level annotated corpora. The new query language involves automatic conversion of queries into the underlying, more complicated MMAXQL query language. It supports queries for sequential and hierarchical, but also associative (e.g. coreferential) relations. The simplified query language has been designed with non-expert users in mind.
We present a gold standard for semantic relation extraction in the food domain for German. The relation types that we address are motivated by scenarios for which IT applications present a commercial potential, such as virtual customer advice in which a virtual agent assists a customer in a supermarket in finding those products that satisfy their needs best. Moreover, we focus on those relation types that can be extracted from natural language text corpora, ideally content from the internet, such as web forums, that are easy to retrieve. A typical relation type that meets these requirements are pairs of food items that are usually consumed together. Such a relation type could be used by a virtual agent to suggest additional products available in a shop that would potentially complement the items a customer has already in their shopping cart. Our gold standard comprises structural data, i.e. relation tables, which encode relation instances. These tables are vital in order to evaluate natural language processing systems that extract those relations.
We apply a decision tree based approach to pronoun resolution in spoken dialogue. Our system deals with pronouns with NP- and non-NP-antecedents. We present a set of features designed for pronoun resolution in spoken dialogue and determine the most promising features. We evaluate the system on twenty Switchboard dialogues and show that it compares well to Byron’s (2002) manually tuned system.
The Manatee corpus management system on which the Sketch Engine is built is efficient, but unable to harness the power of today’s multiprocessor machines. We describe a new, compatible implementation of Manatee which we develop in the Go language and report on the performance gains that we obtained.
Im Beitrag werden ausgewählte semantische und syntaktische Eigenschaften von AcI-Konstruktionen bei Wahrnehmungsverben im Deutschen, Italienischen und Ungarischen anhand einer Korpusanalyse dargestellt. Dabei wird in erster Linie auf Eigenschaften eingegangen, denen in der bisherigen Forschung wenig Aufmerksamkeit gewidmet wurde. Das Hauptziel ist, syntaktische Eigenschaften der Konstruktion aufzudecken, die sich von den Eigenschaften von Sätzen mit einer weniger markierten syntaktischen Struktur unterscheiden. Des Weiteren wird auch auf den Grammatikalisierungsgrad der Konstruktion in den einzelnen Vergleichssprachen eingegangen.
This paper presents Release 2.0 of the SALSA corpus, a German resource for lexical semantics. The new corpus release provides new annotations for German nouns, complementing the existing annotations of German verbs in Release 1.0. The corpus now includes around 24,000 sentences with more than 36,000 annotated instances. It was designed with an eye towards NLP applications such as semantic role labeling but will also be a useful resource for linguistic studies in lexical semantics.
We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and Markables. The API allows linguists to access corpora in terms of these discourse-level elements, i.e. at a conceptual level they are familiar with, with the flexibility offered by a general purpose programming language. It is also a contribution to corpus standardization efforts because it is based on a straightforward and easily extensible data model which can serve as a target for conversion of different corpus formats.