Refine
Document Type
- Article (5)
- Conference Proceeding (5)
- Part of a Book (2)
Has Fulltext
- yes (12)
Keywords
- Korpus <Linguistik> (12) (remove)
Publicationstate
- Veröffentlichungsversion (11)
- Postprint (1)
Reviewstate
- Peer-review (12) (remove)
Lexicographic meaning descriptions of German lexical items which are formally and semantically similar and therefore easily confused (so-called paronyms) often do not reflect their current usage of lexical items. They can even contradict one’s personal intuition or disagree with lexical usage as observed in public discourse. The reasons are manifold. Language data used for compiling dictionaries is either outdated, or lexicographic practice is rather conventional and does not take advantage of corpus-assisted approaches to semantic analysis. Despite of various modern electronic or online reference works speakers face uncertainties when dealing with easily confusable words. These are for example sensibel/sensitiv (sensitive) or kindisch/kindlich (childish/childlike). Existing dictionaries often do not provide satisfactory answers as to how to use these sets correctly. Numerous questions addressed in online forums show where uncertainties with paronyms are and why users demand further assistance concerning proper contextual usage (cf. Storjohann 2015). There are different reasons why users misuse certain items or mix up words which are similar in form and meaning. As data from written and more spontaneous language resources suggest, some confusions arise due to ongoing semantic change in the current use of some paronyms. This paper identifies shortcomings of contemporary German Dictionaries and discusses innovative ways of empirical lexicographic work that might pave the way for a new data-driven, descriptive reference work of confusable German terms. Currently, such a guide is being developed at the Institute for German Language in Mannheim implementing corpora and diverse corpus-analytical methods. Its objective is to compile a dictionary with contrastive entries which is a useful reference tool in situation of language doubt. At the same time, it aims at sensitizing users of context dependency and language change.
As the nature of negative polarity items (NPIs) and their licensing contexts is still under much debate, a broad empirical basis is an important cornerstone to support further insights in this area of research. The work discussed in this paper is intended as a contribution to realizing this objective. The authors briefly introduce the phenomenon of NPIs and outline major theories about their licensing and also various licensing contexts before discussing our major topics: Firstly, a corpus-based retrieval method for NPI candidates is described that ranks the candidates according to their distributional dependence on the licensing contexts. Our method extracts single-word candidates and is extended to also capture multi-word candidates. The basic idea for automatically collecting NPI candidates from a large corpus is that an NPI behaves like a kind of collocate to its licensing contexts. Manual inspection and interpretation of the candidate lists identify the actual NPIs. Secondly, an online repository for NPIs and other items that show distributional idiosyncrasies is presented, which offers an empirical database for further (theoretical) research on these items in a sustainable way.
Catching the common cause: extraction and annotation of causal relations and their participants
(2017)
In this paper, we present a simple, yet effective method for the automatic identification and extraction of causal relations from text, based on a large English-German parallel corpus. The goal of this effort is to create a lexical resource for German causal relations. The resource will consist of a lexicon that describes constructions that trigger causality as well as the participants of the causal event, and will be augmented by a corpus with annotated instances for each entry, that can be used as training data to develop a system for automatic classification of causal relations. Focusing on verbs, our method harvested a set of 100 different lexical triggers of causality, including support verb constructions. At the moment, our corpus includes over 1,000 annotated instances. The lexicon and the annotated data will be made available to the research community.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of linguistic theories that take social interaction, involving language, into account. This paper introduces the corpora and datasets of a project scrutinizing this kind of feedback utterances in French. We present the genesis of the corpora (for a total of about 16 hours of transcribed and phone force-aligned speech) involved in the project. We introduce the resulting datasets and discuss how they are being used in on-going work with focus on the form-function relationship of conversational feedback. All the corpora created and the datasets produced in the framework of this project will be made available for research purposes.
This paper presents a short insight into a new project at the "Institute for the German Language” (IDS) (Mannheim). It gives an insight into some basic ideas for a corpus-based dictionary of spoken German, which will be developed and compiled by the new project "The Lexicon of spoken German” (Lexik des gesprochenen Deutsch, LeGeDe). The work is based on the "Research and Teaching Corpus of Spoken German” (Forschungs- und Lehrkorpus Gesprochenes Deutsch, FOLK), which is implemented in the "Database for Spoken German” (Datenbank für Gesprochenes Deutsch, DGD). Both resources, the database and the corpus, have been developed at the IDS.
As a consequence of a recent curation project, the Dortmund Chat Corpus is available in CLARIN-D research infrastructures for download and querying. In a legal expertise it had been recommended that standard measures of anonymisation be applied to the corpus before its republication. This paper reports about the anonymisation campaign that was conducted for the corpus. Anonymisation has been realised as categorisation, and the taxonomy of anonymisation categories applied is introduced and the method of applying it to the TEI files is demonstrated. The results of the anonymisation campaign as well as issues of quality assessment are discussed. Finally, pseudonymisation as an alternative to categorisation as a method of the anonymisation of CMC data is discussed, as well as possibilities of an automatisation of the process.
Ein integriertes Datenbank-, Such- und Tagging-Tool (IDaSTo) wird vorgestellt, das sich besonders für Variablenanalysen, für Paralleltexte und für diachronische Untersuchungen eignet. Relevante Kategorien bzw. Variablen können individuell definiert, Tags frei im Text und auf verschiedenen Wegen gesetzt und ihre Häufigkeiten in den verlinkten Statistiken direkt abgerufen werden.