Refine
Year of publication
- 2014 (23) (remove)
Document Type
- Conference Proceeding (23) (remove)
Has Fulltext
- yes (23)
Is part of the Bibliography
- no (23) (remove)
Keywords
- Korpus <Linguistik> (9)
- Deutsch (7)
- Computerlinguistik (5)
- German (4)
- Französisch (3)
- Fremdsprachenlernen (3)
- Information Extraction (3)
- pitch variation (3)
- Akustische Phonetik (2)
- Annotation (2)
Publicationstate
- Veröffentlichungsversion (23) (remove)
Reviewstate
- Peer-Review (10)
- (Verlags)-Lektorat (6)
Publisher
Automatic Food Categorization from Large Unlabeled Corpora and Its Impact on Relation Extraction
(2014)
We present a weakly-supervised induction method to assign semantic information to food items. We consider two tasks of categorizations being food-type classification and the distinction of whether a food item is composite or not. The categorizations are induced by a graph-based algorithm applied on a large unlabeled domain-specific corpus. We show that the usage of a domain-specific corpus is vital. We do not only outperform a manually designed open-domain ontology but also prove the usefulness of these categorizations in relation extraction, outperforming state-of-the-art features that include syntactic information and Brown clustering.
We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular political group or in particular language combinations.
We compare several different corpus- based and lexicon-based methods for the scalar ordering of adjectives. Among them, we examine for the first time a low- resource approach based on distinctive- collexeme analysis that just requires a small predefined set of adverbial modifiers. While previous work on adjective intensity mostly assumes one single scale for all adjectives, we group adjectives into different scales which is more faithful to human perception. We also apply the methods to both polar and non-polar adjectives, showing that not all methods are equally suitable for both types of adjectives.
This study presents the results of a large-scale comparison of various measures of pitch range and pitch variation in two Slavic (Bulgarian and Polish) and two Germanic (German and British English) languages. The productions of twenty-two speakers per language (eleven male and eleven female) in two different tasks (read passages and number sets) are compared. Significant differences between the language groups are found: German and English speakers use lower pitch maxima, narrower pitch span, and generally less variable pitch than Bulgarian and Polish speakers. These findings support the hypothesis that inguistic communities tend to be characterized by particular pitch profiles.
This contribution presents the newest version of our ’Wortverbindungsfelder’ (fields of multi-word expressions), an experimental lexicographic resource that focusses on aspects of MWEs that are rarely addressed in traditional descriptions: Contexts, patterns and interrelations. The MWE fields use data from a very large corpus of written German (over 6 billion word forms) and are created in a strictly corpus-based way. In addition to traditional lexicographic descriptions, they include quantitative corpus data which is structured in new ways in order to show the usage specifics. This way of looking at MWEs gives insight in the structure of language and is especially interesting for foreign language learners.
Designing a Bilingual Speech Corpus for French and German Language Learners: a Two-Step Process
(2014)
We present the design of a corpus of native and non-native speech for the language pair French-German, with a special emphasis on phonetic and prosodic aspects. To our knowledge there is no suitable corpus, in terms of size and coverage, currently available for the target language pair. To select the target L1-L2 interference phenomena we prepare a small preliminary corpus (corpus1), which is analyzed for coverage and cross-checked jointly by French and German experts. Based on this analysis, target phenomena on the phonetic and phonological level are selected on the basis of the expected degree of deviation from the native performance and the frequency of occurrence. 14 speakers performed both L2 (either French or German) and L1 material (either German or French). This allowed us to test, recordings duration, recordings material, the performance of our automatic aligner software. Then, we built corpus2 taking into account what we learned about corpus1. The aims are the same but we adapted speech material to avoid too long recording sessions. 100 speakers will be recorded. The corpus (corpus1 and corpus2) will be prepared as a searchable database, available for the scientific community after completion of the project.
Der Beitrag beschäftigt sich mit den verschiedenen Such-, Auffindungs- und Auswahlsprozessen, die für die fremdsprachige Produktion notwendig sind und von DICONALE-online, einem onomasiologisch-konzeptuell ausgerichteten, zweisprachig-bilateral konzipierten Verbwörterbuch der spanischen und deutschen Gegenwartsspache, besonders berücksichtigt werden. Der Ausgangspunkt von DICONALE ist ein unbefriedigendes Informationsangebot in den bestehenden ein- und zweisprachigen Lernerwörterbüchern für den L2-output und bestätigt das Projektteam in der Notwendigkeit, ein neuartiges benutzer- und situationsdefiniertes online-Nachschlagewerk zu erstellen. Zwei Bezugsrahmen bilden die Grundlage für einen komplexen, konzeptuell und framegeleiteten Zugriffspfad, der dem Benutzer bei der Suche und Auswahl von Ausdrucksmöglichkeiten und der adäquaten Anwendung behilflich sein soll. Das Novum dieses Wörterbuchprojekts besteht hauptsachlich darin, eine onomasiologisch-konzeptuelle Perspektive für den fremdsprachigen Produktionsprozess nutzbar zu machen und mit einem semasiologischen Zugriff zu verbinden, durch den es möglich ist, die inter- und intralingualen Unterschiede zwischen den Lexemen eines lexikalisch-semantischen (Sub)Paradigmas hervorzuheben. Ziel des Beitrages ist es daher, den Ausgangspunkt, sowie die theoretischen und methodologischen Grundlagen von DICONALE-online unter der speziellen Perspektive der Benutzer- und Situationsorientiertheit zur Diskussion zu stellen, die einzelnen Zugriffspfade für den Such- und Auffindungsprozess vorzustellen und das Angebot zur Auswahl und zum adäquaten Gebrauch aus inter- und intralingualer Perspektive zu präsentieren.
This study investigates cross-language differences in pitch range and variation in four languages from two language groups: English and German (Germanic) and Bulgarian and Polish (Slavic). The analysis is based on large multi-speaker corpora (48 speakers for Polish, 60 for each of the other three languages). Linear mixed models were computed that include various distributional measures of pitch level, span and variation, revealing characteristic differences across languages and between language groups. A classification experiment based on the relevant parameter measures (span, kurtosis and skewness values for pitch distributions for each speaker) succeeded in separating the language groups.
Recent work suggests that concreteness and imageability play an important role in the meanings of figurative expressions. We investigate this idea in several ways. First, we try to define more precisely the context within which a figurative expression may occur, by parsing a corpus annotated for metaphor. Next, we add both concreteness and imageability as “features” to the parsed metaphor corpus, by marking up words in this corpus using a psycholinguistic database of scores for concreteness and imageability. Finally, we carry out detailed statistical analyses of the augmented version of the original metaphor corpus, cross-matching the features of concreteness and imageability with others in the corpus such as parts of speech and dependency relations, in order to investigate in detail the use of such features in predicting whether a given expression is metaphorical or not.
Dieser Artikel gibt einen Einblick in das GeoBib-Projekt und die Problematik der Verwendung von historischen Karten und der daraus abgeleiteten Geodaten in einem WebGIS. Das GeoBib-Projekt hat zum Ziel, eine annotierte und georeferenzierte Online-Bibliographie der frühen deutsch- bzw. polnischsprachigen Holocaust- und Lagerliteratur von 1933 bis 1949 bereitzustellen. Zu diesem Zeitraum werden historische Karten und Geodaten gesammelt, aufbereitet und im zugehörigen WebGIS des GeoBib-Portals visualisiert. Eine Besonderheit ist die aufwendige Recherche von Geodaten und Kartenmaterial für den Zeitraum zwischen 1933 und 1949. Die Problematiken bezüglich der Recherche und späteren Visualisierung historischer Geodaten und des Kartenmaterials sind ein Hauptaugenmerk in diesem Artikel. Weiterhin werden Konzepte für die Visualisierung von historischem, unvollständigem Kartenmaterial präsentiert und ein möglicher Lösungsweg für die bestehenden Herausforderungen aufgezeigt.
We study the influence of information structure on the salience of subjective expressions for human readers. Using an online survey tool, we conducted an experiment in which we asked users to rate main and relative clauses that contained either a single positive or negative or a neutral adjective. The statistical analysis of the data shows that subjective expressions are more prominent in main clauses where they are asserted than in relative clauses where they are presupposed. A corpus study suggests that speakers are sensitive to this differential salience in their production of subjective expressions.
Alors que de nombreuses études en analyse conversationnelle se sont intéressées à la manière dont des locuteurs co-construisent un tour de parole (notamment sur le plan syntaxique et prosodique), la façon dont la co-construction est ensuite évaluée n'a pas encore été étudiée en profondeur au sein de la littérature interactionniste. Ici, nous étudions deux pratiques permettant à un locuteur de valider une co-construction, à savoir l'acquiescement simple et l'hétéro-répétition de la complétion. En menant une analyse séquentielle et multimodale de plusieurs séquences de co-construction en français, nous montrons qu’à travers ces deux procédés – qui semblent au premier abord similaires dans leur fonctionnement – les locuteurs effectuent une évaluation très différente : tandis que l'acquiescement simple valide la complétion proposée uniquement comme une version possible, l'hétéro-répétition la valide comme étant une complétion complètement adéquate. Cette contribution met en évidence que les interactants exploitent des ressources audibles aussi bien que visibles afin de manifester si et dans quel sens ils acceptent la complétion de leur tour de parole de la part d’un coparticipant. Nous soulignons l’importance d’étudier en détail les différents formatages possibles des tours évaluant une complétion afin de pouvoir distinguer différentes formes « d’acceptation » et de révéler la manière dont les locuteurs peuvent finement négocier leur position en tant que (co-)auteur ou destinataire d’un tour de parole.
We investigate how the granularity of POS tags influences POS tagging, and furthermore, how POS tagging performance relates to parsing results. For this, we use the standard “pipeline” approach, in which a parser builds its output on previously tagged input. The experiments are performed on two German treebanks, using three POS tagsets of different granularity, and six different POS taggers, together with the Berkeley parser. Our findings show that less granularity of the POS tagset leads to better tagging results. However, both too coarse-grained and too fine-grained distinctions on POS level decrease parsing performance.
Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data. This paper targets the detection of POS errors in automatically annotated corpora, so-called silver standards, showing that by combining different measures sensitive to annotation quality we can identify a large part of the errors and obtain a substantial increase in accuracy.
The annotation of parts of speech (POS) in linguistically annotated corpora is a fundamental annotation layer which provides the basis for further syntactic analyses, and many NLP tools rely on POS information as input. However, most POS annotation schemes have been developed with written (newspaper) text in mind and thus do not carry over well to text from other domains and genres. Recent discussions have concentrated on the shortcomings of present POS annotation schemes with regard to their applicability to data from domains other than newspaper text.
We report on the two systems we built for Task 1 of the German Sentiment Analysis Shared Task, the task on Source, Subjective Expression and Target Extraction from Political Speeches (STEPS). The first system is a rule-based system relying on a predicate lexicon specifying extraction rules for verbs, nouns and adjectives, while the second is a translation-based system that has been obtained with the help of the (English) MPQA corpus.
We examine the task of separating types from brands in the food domain. Framing the problem as a ranking task, we convert simple textual features extracted from a domain-specific corpus into a ranker without the need of labeled training data. Such method should rank brands (e.g. sprite) higher than types (e.g. lemonade). Apart from that, we also exploit knowledge induced by semi-supervised graph-based clustering for two different purposes. On the one hand, we produce an auxiliary categorization of food items according to the Food Guide Pyramid, and assume that a food item is a type when it belongs to a category unlikely to contain brands. On the other hand, we directly model the task of brand detection using seeds provided by the output of the textual ranking features. We also harness Wikipedia articles as an additional knowledge source.
This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.
This article presents preliminary results indicating that speakers have a different pitch range when they speak a foreign language compared to the pitch variation that occurs when they speak their native language. To this end, a learner corpus with French and German speakers was analyzed. Results suggest that speakers indeed produce a smaller pitch range in the respective L2. This is true for both groups of native speakers. A possible explanation for this finding is that speakers are less confident in their productions, therefore, they concentrate more on segments and words and subsequently refrain from realizing pitch range more native-like. For language teaching, the results suggest that learners should be trained extensively on the more pronounced use of pitch in the foreign language.