Refine
Year of publication
- 2016 (26) (remove)
Document Type
- Article (8)
- Part of a Book (7)
- Conference Proceeding (7)
- Book (3)
- Doctoral Thesis (1)
Has Fulltext
- yes (26) (remove)
Is part of the Bibliography
- yes (26) (remove)
Keywords
- Deutsch (13)
- Korpus <Linguistik> (5)
- Gesprochene Sprache (3)
- Computerunterstützte Lexikographie (2)
- Forschungsmethode (2)
- Französisch (2)
- German (2)
- Kontrastive Syntax (2)
- Norwegisch (2)
- Online-Wörterbuch (2)
Publicationstate
- Veröffentlichungsversion (26) (remove)
Reviewstate
Publisher
- European Language Resources Association (ELRA) (6)
- Institut für Deutsche Sprache (6)
- De Gruyter (5)
- Association for Computational Linguistics (1)
- Buske (1)
- E. Schmidt (1)
- Frank & Timme (1)
- Routledge (1)
- Universität Mannheim (1)
- de Gruyter (1)
Dieser Aufsatz präsentiert Ergebnisse, die im Rahmen des binationalen Forschungsprojekts SDiv erarbeitet wurden 1. Im vorliegenden Text mit seinem Schwerpunkt auf den Textsorten innerhalb des Korpus Öffentliche Bekanntmachungen geht es im ersten Schritt um die Bestimmung der kommunikativen Merkmale, den historischen Rahmen, die quantitative Verteilung der Textsorten sowie ihre Klassifikation innerhalb des betreffenden Korpus 2. Im zweiten Schritt wird eine Analyse der sprachlichen Strukturen und Routinen durchgeführt. Das Ziel der textlinguistischen Analyse ist die Rekonstruktion der „Kommunikationsbedürfnisse und Kommunikationsbedingungen“ (Mattheier, 1998: 4), so wie sie im Zusammenhang mit den Öffentlichen Bekanntmachungen im 19. Jh. wirksam gewesen sind. Textsorten und Textsortengeschichte(n) haben den Vorzug, dass sie als „Schaltstellen zwischen Geschichte der Sprache und der Geschichte der Sprachgemeinschaft“ (Mattheier, 1998: 4) fungieren können, insofern leistet die hier realisierte Studie zu den Textsorten innerhalb der Öffentlichen Bekanntmachungen auch einen Beitrag zur Geschichte des Deutschen sowie zur Sozialgeschichte der mehrsprachigen Sprecher des Deutschen in Luxemburg. Der nachfolgende Aufsatz gliedert sich in sechs Abschnitte: 1. Einleitung, 2. Historischer Rahmen, 3. Korpus, 4. Theoretische Einbettung, 5. Sprachexterne Faktoren, 6. Sprachinterne Faktoren (z. B. grammatische Merkmale) sowie 7. Bilanz
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
On the basis of a law text corpus which consists of judicial decisions and jurisprudential papers on so-called assisted suicide from 1977 to 2011, agonal centres are determined within the paradigm of corpus-based pragma-semiotic text analysis. Agonal centres are defined as action-guiding concepts that are in conflict with each other concerning the general acceptance of event interpretations, options for actions, claims of validity, contextual knowledge and values. These action-guiding concepts are derived with the help of quantitative and qualitative methods. Discourse linguistic interpretations are thus rendered more objective with the help of semi-automatic methods; furthermore, specific discourse features of the discourse and approaches to interpretation can be derived from (un)expected linguistic significances of occurrence, distribution, frequency etc. at the linguistic surface. Finally, these agonal centres specific to the language of law are compared to agonal centres which are determined on the basis of a media corpus on the same issue. This provides a comparative insight into the constitution of a seemingly identical fact in everyday and special language, which demonstrates the sociopolitical relevance of analysing the constitution of reality as instructed by language.
This paper presents C-WEP, the Collection of Writing Errors by Professionals Writers of German. It currently consists of 245 sentences with grammatical errors. All sentences are taken from published texts. All authors are professional writers with high skill levels with respect to German, the genres, and the topics. The purpose of this collection is to provide seeds for more sophisticated writing support tools as only a very small proportion of those errors can be detected by state-of-the-art checkers. C-WEP is annotated on various levels and freely available.
The CELEX database is one of the standard lexical resources for German. It yields a wealth of data especially for phonological and morphological applications. The morphological part comprises deep-structure morphological analyses of German. However, as it was developed in the Nineties, both encoding and spelling are outdated. About one fifth of over 50,000 datasets contain umlauts and signs such as ß. Changes to a modern version cannot be obtained by simple substitution. In this paper, we shortly describe the original content and form of the orthographic and morphological database for German in CELEX. Then we present our work on modernizing the linguistic data. Lemmas and morphological analyses are transferred to a modern standard of encoding by first merging orthographic and morphological information of the lemmas and their entries and then performing a second substitution for the morphs within their morphological analyses. Changes to modern German spelling are performed by substitution rules according to orthographical standards. We show an example of the use of the data for the disambiguation of morphological structures. The discussion describes prospects of future work on this or similar lexicons. The Perl script is publicly available on our website.
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
This contribution presents the background, design and results of a study of users of three oral corpus platforms in Germany. Roughly 5.000 registered users of the Database for Spoken German (DGD), the GeWiss corpus and the corpora of the Hamburg Centre for Language Corpora (HZSK) were asked to participate in a user survey. This quantitative approach was complemented by qualitative interviews with selected users. We briefly introduce the corpus resources involved in the study in section 2. Section 3 describes the methods employed in the user studies. Section 4 summarizes results of the studies focusing on selected key topics. Section 5 attempts a generalization of these results to larger contexts.
Sense relations
(2016)
In diesem Beitrag liegt der Fokus auf der Vorfeldbesetzung des deutschen Satzes, insofern das Vorfeld einerseits aus einem Satzglied oder mehreren Satzgliedern und einem infiniten Teil des Verbalkomplexes oder andererseits nur aus dem infiniten Teil des Verbalkomplexes besteht. Bei diesen Formen der Vorfeldbesetzung werden Varianten und deren informationsstrukturelle Besonderheiten betrachtet. Des Weiteren soll der Frage nachgegangen werden, ob – entgegen einer haufig vorgebrachten Regel, dass das Vorfeld des deutschen Satzes nur einfach besetzt werden kann – eindeutige und auch akzeptable Belege in den Wikipedia-Korpora auffindbar sind, die darauf hinweisen, dass im Deutschen durchaus eine Vorfeldbesetzung mit mehr als einem Satzglied auftreten kann.
In order to develop its full potential, global communication needs linguistic support systems such as Machine Translation (MT). In the past decade, free online MT tools have become available to the general public, and the quality of their output is increasing. However, the use of such tools may entail various legal implications, especially as far as processing of personal data is concerned. This is even more evident if we take into account that their business model is largely based on providing translation in exchange for data, which can subsequently be used to improve the translation model, but also for commercial purposes. The purpose of this paper is to examine how free online MT tools fit in the European data protection framework, harmonised by the EU Data Protection Directive. The perspectives of both the user and the MT service provider are taken into account.