Refine
Year of publication
- 2017 (11) (remove)
Document Type
- Part of a Book (11) (remove)
Has Fulltext
- yes (11)
Keywords
- Deutsch (5)
- Korpus <Linguistik> (4)
- Annotation (2)
- Computerlinguistik (2)
- Corpus linguistics (2)
- Kontrastive Linguistik (2)
- Phraseologie (2)
- Automatische Sprachverarbeitung (1)
- Autorschaft (1)
- Bairisch (1)
Publicationstate
- Veröffentlichungsversion (8)
- Postprint (2)
Reviewstate
- Peer-Review (11) (remove)
Publisher
- Institut für Deutsche Sprache (2)
- Springer (2)
- Synchron (2)
- The Association for Computational Linguistics (2)
- Editions Tradulex (1)
- Language Science Press (1)
- Steiner (1)
We use a convolutional neural network to perform authorship identification on a very homogeneous dataset of scientific publications. In order to investigate the effect of domain biases, we obscure words below a certain frequency threshold, retaining only their POS-tags. This procedure improves test performance due to better generalization on unseen data. Using our method, we are able to predict the authors of scientific publications in the same discipline at levels well above chance.
We introduce a method for error detection in automatically annotated text, aimed at supporting the creation of high-quality language resources at affordable cost. Our method combines an unsupervised generative model with human supervision from active learning. We test our approach on in-domain and out-of-domain data in two languages, in AL simulations and in a real world setting. For all settings, the results show that our method is able to detect annotation errors with high precision and high recall.
In my talk, I present an empirical approach to detecting and describing proverbs as frozen sentences with specific functions in current language use. We have developed this approach in the EU project ‘SprichWort’ (based on the German Reference Corpus). The first chapter illustrates selected aspects of our complex, iterative procedure to validate proverb candidates. Based on our corpus-driven lexpan methodology of slot analysis I then discuss semantic restrictions of proverb patterns. Furthermore, I show different degrees of proverb quality ranging from genuine proverbs to non-proverb realizations of the same abstract pattern. On the one hand, the corpus validation reveals that proverbs are definitely perceived and used as relatively fixed entities and often as sentences. On the other hand, proverbs are not only interpreted as an interesting unique phenomenon but also as part of the whole lexicon, embedded in networks of different lexical items.
Researchers interested in the sounds of speech or the physical gestures of Speakers make use of audio and video recordings in their work. Annotating these recordings presents a different set of requirements to the annotation of text. Special purpose tools have been developed to display video and audio Signals and to allow the creation of time-aligned annotations. This chapter reviews the most widely used of these tools for both manual and automatic generation of annotations on multimodal data.
We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.
Modeling the properties of German phrasal compounds within a usage-based constructional approach
(2017)
This paper discusses phrasal compounds in German (e.g.“Man-muss-doch-überalles-reden-können”-Credo, ‘one-should-be-able-to-talk-about-everything motto’). It provides the first empirically based investigation and description of this wordformation type within the theoretical framework of construction grammar. While phrasal compounds pose a problem for “traditional” generative approaches, I argue that a usage-based constructional model (e.g. Langacker 1987; Goldberg 2006) which takes into consideration aspects of frequency provides a suitable approach to modeling and explaining their properties. For this purpose, a large inventory of phrasal compounds was extracted from the German Reference Corpus (DeReKo) and modeled as pairings of form and meaning at different levels of specificity and abstractness within a bottom-up process.
Overall, this paper not only presents a new and original approach to phrasal compounds, but also offers interesting perspectives for dealing with composition in general.
CMC Corpora in DeReKo
(2017)
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively.