Refine
Year of publication
- 2014 (20) (remove)
Document Type
- Conference Proceeding (9)
- Part of a Book (7)
- Article (4)
Has Fulltext
- yes (20)
Keywords
- Korpus <Linguistik> (20) (remove)
Publicationstate
- Veröffentlichungsversion (20) (remove)
Reviewstate
- (Verlags)-Lektorat (11)
- Peer-Review (5)
Publisher
Annotating Spoken Language
(2014)
Automatic Food Categorization from Large Unlabeled Corpora and Its Impact on Relation Extraction
(2014)
We present a weakly-supervised induction method to assign semantic information to food items. We consider two tasks of categorizations being food-type classification and the distinction of whether a food item is composite or not. The categorizations are induced by a graph-based algorithm applied on a large unlabeled domain-specific corpus. We show that the usage of a domain-specific corpus is vital. We do not only outperform a manually designed open-domain ontology but also prove the usefulness of these categorizations in relation extraction, outperforming state-of-the-art features that include syntactic information and Brown clustering.
In this paper, we present the concept and the results of two studies addressing (potential) users of monolingual German online dictionaries, such as www.elexiko.de. Drawing on the example of elexiko, the aim of those studies was to collect empirical data on possible extensions of the content of monolingual online dictionaries, e.g. the search function, to evaluate how users comprehend the terminology of the user interface, to find out which types of information are expected to be included in each specific lexicographic module and to investigate general questions regarding the function and reception of examples illustrating the use of a word. The design and distribution of the surveys is comparable to the studies described in the chapters 5-8 of this volume. We also explain, how the data obtained in our studies were used for further improvement of the elexiko-dictionary.
We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular political group or in particular language combinations.
We compare several different corpus- based and lexicon-based methods for the scalar ordering of adjectives. Among them, we examine for the first time a low- resource approach based on distinctive- collexeme analysis that just requires a small predefined set of adverbial modifiers. While previous work on adjective intensity mostly assumes one single scale for all adjectives, we group adjectives into different scales which is more faithful to human perception. We also apply the methods to both polar and non-polar adjectives, showing that not all methods are equally suitable for both types of adjectives.
This contribution presents the newest version of our ’Wortverbindungsfelder’ (fields of multi-word expressions), an experimental lexicographic resource that focusses on aspects of MWEs that are rarely addressed in traditional descriptions: Contexts, patterns and interrelations. The MWE fields use data from a very large corpus of written German (over 6 billion word forms) and are created in a strictly corpus-based way. In addition to traditional lexicographic descriptions, they include quantitative corpus data which is structured in new ways in order to show the usage specifics. This way of looking at MWEs gives insight in the structure of language and is especially interesting for foreign language learners.
Designing a Bilingual Speech Corpus for French and German Language Learners: a Two-Step Process
(2014)
We present the design of a corpus of native and non-native speech for the language pair French-German, with a special emphasis on phonetic and prosodic aspects. To our knowledge there is no suitable corpus, in terms of size and coverage, currently available for the target language pair. To select the target L1-L2 interference phenomena we prepare a small preliminary corpus (corpus1), which is analyzed for coverage and cross-checked jointly by French and German experts. Based on this analysis, target phenomena on the phonetic and phonological level are selected on the basis of the expected degree of deviation from the native performance and the frequency of occurrence. 14 speakers performed both L2 (either French or German) and L1 material (either German or French). This allowed us to test, recordings duration, recordings material, the performance of our automatic aligner software. Then, we built corpus2 taking into account what we learned about corpus1. The aims are the same but we adapted speech material to avoid too long recording sessions. 100 speakers will be recorded. The corpus (corpus1 and corpus2) will be prepared as a searchable database, available for the scientific community after completion of the project.
Recent work suggests that concreteness and imageability play an important role in the meanings of figurative expressions. We investigate this idea in several ways. First, we try to define more precisely the context within which a figurative expression may occur, by parsing a corpus annotated for metaphor. Next, we add both concreteness and imageability as “features” to the parsed metaphor corpus, by marking up words in this corpus using a psycholinguistic database of scores for concreteness and imageability. Finally, we carry out detailed statistical analyses of the augmented version of the original metaphor corpus, cross-matching the features of concreteness and imageability with others in the corpus such as parts of speech and dependency relations, in order to investigate in detail the use of such features in predicting whether a given expression is metaphorical or not.