Korpuslinguistik
Refine
Year of publication
- 2014 (22) (remove)
Document Type
- Conference Proceeding (13)
- Part of a Book (5)
- Article (3)
- Book (1)
Has Fulltext
- yes (22)
Keywords
Publicationstate
Reviewstate
- (Verlags)-Lektorat (7)
- Peer-Review (2)
Publisher
- European Language Resources Association (ELRA) (5)
- Institut für Deutsche Sprache (4)
- ELRA (2)
- European Language Resources Association (2)
- Association for Computational Linguistics and Dublin City University (1)
- Institute for Specialised Communication and Multilingualism (1)
- Narr (1)
- Universität Hildesheim (1)
- Universitätsverlag Hildesheim (1)
- de Gruyter (1)
Der Beitrag beschäftigt sich mit der Frage, wie und inwieweit korpusbasierte Ansätze zur Untersuchung und Bewertung von Sprachwandel beitragen können. Die Bewertung von Sprachwandel erscheint in dieser Hinsicht interessant, da sie erstens von größerem öffentlichen Interesse ist, zweitens nicht zu den Kernthemen der Sprachwissenschaft zählt und drittens sowohl die geisteswissenschaftlichen Aspekte der Sprachwissenschaft berührt als auch die empirischen, die eher für die so genannten harten Wissenschaften typisch sind. Letzteres trifft bei der Frage nach Sprachverfall (gutem vs. schlechtem Deutsch diachron) vermutlich unbestrittener zu als bei der Frage nach richtigem vs. falschem Deutsch, da zu ihrer Beantwortung offensichtlich einerseits empirische, messbare Kriterien herangezogen werden müssen, andererseits aber auch weitere Kriterien notwendig sind und es außerdem einer Entscheidung zur Einordnung und Gewichtung der verschiedenartigen Kriterien sowie einer Begründung dieser Entscheidung bedarf. Zur Annäherung an die Fragestellung werden zunächst gängige, leicht operationalisierbare Hypothesen zu Symptomen eines potenziellen Verfalls des Deutschen auf verschiedenen DeReKo-basierten Korpora überprüft und im Hinblick auf ihre Verallgemeinerbarkeit und Tragweite diskutiert. Im zweiten Teil werden weitere empirische Ansätze zur Untersuchung von Wandel, Variation und Dynamik skizziert, die zur Diskussion spezieller Aspekte von Sprachverfall beitragen könnten. Im Schlussteil werden die vorgestellten Ansätze in den Gesamtkontext einer sprachwissenschaftlichen Untersuchung von Sprachverfall gestellt und vor dem Hintergrund seines gesellschaftlichen Diskurses reflektiert.
This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.
Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tübingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.
Der korpuslinguistische Ansatz des Projekts »Korpusgrammatik« eröffnet neue Perspektiven auf unsere Sprachwirklichkeit allgemein und grammatische Regularitäten im Besonderen. Der vorliegende Band klärt auf, wie man korpuslinguistisch nach dem Standard fragen kann, wie die Projektkorpora aufgebaut und in einer Korpusdatenbank erschlossen sind, wie man in einem automatischen Abfragesystem der Variabilität der Sprache zu Leibe rückt und sie sogar messbar macht, schließlich aber auch, wo die Grenzen quantitativer Korpusanalysen liegen. Pilotstudien deuten an, wie der Ansatz unsere grammatischen Horizonte erweitert und die Grammatikografie voranbringt.
We investigate how the granularity of POS tags influences POS tagging, and furthermore, how POS tagging performance relates to parsing results. For this, we use the standard “pipeline” approach, in which a parser builds its output on previously tagged input. The experiments are performed on two German treebanks, using three POS tagsets of different granularity, and six different POS taggers, together with the Berkeley parser. Our findings show that less granularity of the POS tagset leads to better tagging results. However, both too coarse-grained and too fine-grained distinctions on POS level decrease parsing performance.
Newspapers became extremely popular in Germany during the 18th and 19th century, and thus increasingly influential for modern German. However, due to the lack of digitized historical newspaper corpora for German, this influence could not be analyzed systematically. In this paper, we introduce the Mannheim Corpus of Digital Newspapers and Magazines, which in its current release comprises 21 newspapers and magazines from the 18th and 19th century. With over 4.1 Mio tokens in about 650 volumes it currently constitutes the largest historical corpus dedicated to newspapers in German. We briefly discuss the prospect of the corpus for analyzing the evolution of news as a genre in its own right and the influence of contextual parameters such as region and register on the language of news. We then focus on one historically influential aspect of newspapers – their role in disseminating foreign words in German. Our preliminary quantitative results indeed indicate that newspapers use foreign words significantly more frequently than other genres, in particular belles lettres.
We describe a systematic and application-oriented approach to training and evaluating named entity recognition and classification (NERC) systems, the purpose of which is to identify an optimal system and to train an optimal model for named entity tagging DeReKo, a very large general-purpose corpus of contemporary German (Kupietz et al., 2010). DeReKo 's strong dispersion wrt. genre, register and time forces us to base our decision for a specific NERC system on an evaluation performed on a representative sample of DeReKo instead of performance figures that have been reported for the individual NERC systems when evaluated on more uniform and less diverse data. We create and manually annotate such a representative sample as evaluation data for three different NERC systems, for each of which various models are learnt on multiple training data. The proposed sampling method can be viewed as a generally applicable method for sampling evaluation data from an unbalanced target corpus for any sort of natural language processing.