Korpuslinguistik
Refine
Year of publication
- 2016 (30) (remove)
Document Type
- Conference Proceeding (20)
- Article (4)
- Part of a Book (4)
- Book (1)
- Part of Periodical (1)
Has Fulltext
- yes (30)
Keywords
- Korpus <Linguistik> (26)
- Deutsch (10)
- Gesprochene Sprache (7)
- Chatten <Kommunikation> (4)
- Textlinguistik (4)
- Texttechnologie (3)
- Annotation (2)
- Automatische Sprachanalyse (2)
- Computerunterstützte Kommunikation (2)
- Datenbank (2)
Publicationstate
Reviewstate
- Peer-Review (11)
- (Verlags)-Lektorat (10)
Publisher
- European Language Resources Association (ELRA) (6)
- Nisaba (4)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (3)
- Ruhr-Universität Bochum (2)
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Association for Computational Linguistics (1)
- Austrian Centre for Digital Humanities, Austrian Academy of Sciences (1)
- CLARIN (1)
- European Language Resources Association (1)
- Ivane Javakhishvili Tbilisi State University (1)
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
The IFCASL corpus is a French-German bilingual phonetic learner corpus designed, recorded and annotated in a project on individualized feedback in computer-assisted spoken language learning. The motivation for setting up this corpus was that there is no phonetically annotated and segmented corpus for this language pair of comparable of size and coverage. In contrast to most learner corpora, the IFCASL corpus incorporate data for a language pair in both directions, i.e. in our case French learners of German, and German learners of French. In addition, the corpus is complemented by two sub-corpora of native speech by the same speakers. The corpus provides spoken data by about 100 speakers with comparable productions, annotated and segmented on the word and the phone level, with more than 50% manually corrected data. The paper reports on inter-annotator agreement and the optimization of the acoustic models for forced speech-text alignment in exercises for computer-assisted pronunciation training. Example studies based on the corpus data with a phonetic focus include topics such as the realization of /h/ and glottal stop, final devoicing of obstruents, vowel quantity and quality, pitch range, and tempo.
The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g. title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection).
Präposition-Substantiv-Verbindungen mit rekurrentem Nullartikel in adverbialer Verwendung – z.B. nach Belieben, auf Knopfdruck, ohne Ende oder bei Nacht – sind ein in der Mehrwortforschung bisher eher vernachlässigter Typ. Sie sind Untersuchungsgegenstand des laufenden Forschungsprojekts „Präpositionale Wortverbindungen kontrastiv“ (beteiligte Institutionen: IDS Mannheim, Universität Santiago de Compostela, Universität Trnava), in das wir in unserem Vortrag einen Einblick vermitteln. Es wird skizziert, wie sich solche Wortverbindungen sowie abstraktere präpositionale Wortverbindungsmuster vom Typ [in + SUBX-Zeit(en) (z.B. in Echtzeit, in Krisenzeiten) aus kontrastiver Sicht (Deutsch – Spanisch – Slowakisch) korpusbasiert untersuchen und lexikografisch beschreiben lassen. Von großem Interesse – gerade auch für Fremdsprachenlerner – sind dabei insbesondere die semantisch-funktionalen Restriktionen, denen solche Entitäten unterliegen. Basierend auf den theoretischen und empirischen Grundannahmen des am IDS entwickelten Modells „Usuelle Wortverbindungen“ (vgl. Steyer 2013) werden im Projekt zunächst Kollokations- und Kotextmuster für die binären deutschen Mehrworteinheiten induktiv in sehr großen Korpora ermittelt; im Anschluss werden sie einem systematischen Vergleich mit dem Spanischen und Slowakischen unterzogen. Methodisch greifen wir – in allen drei Sprachen – u.a. auf Kookkurrenzprofile zu den Wortverbindungen sowie auf Slotanalysen zu definierten Suchmustern zurück. Ziel des Projekts ist u.a. die Entwicklung eines neuartigen Prototyps für eine multilinguale Aufbereitung des Untersuchungsgegentands (speziell für Fremdsprachenlerner).
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank für Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.
There have been several attempts to annotate communicative functions to utterances of verbal feedback in English previously. Here, we suggest an annotation scheme for verbal and non-verbal feedback utterances in French including the categories base, attitude, previous and visual. The data comprises conversations, maptasks and negotiations from which we extracted ca. 13,000 candidate feedback utterances and gestures. 12 students were recruited for the annotation campaign of ca. 9,500 instances. Each instance was annotated by between 2 and 7 raters. The evaluation of the annotation agreement resulted in an average best-pair kappa of 0.6. While the base category with the values acknowledgement, evaluation, answer, elicit and other achieves good agreement, this is not the case for the other main categories. The data sets, which also include automatic extractions of lexical, positional and acoustic features, are freely available and will further be used for machine learning classification experiments to analyse the form-function relationship of feedback.
We introduce our pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure. The pipeline was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards. We describe how the resource has been prepared and restructured in terms of TEI encoding, linguistic annotations, and anonymisation. The output is a CLARIN-conformant resource integrated in the CLARIN-D research infrastructure.
Dieser Beitrag stellt nach einer kurzen allgemeinen Einführung die Datenbank für Gesprochenes Deutsch (DGD) und das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) als Instrumente speziell für gesprächsanalytisches Arbeiten vor. Anhand des Beispiels sprich als Diskursmarker für Reformulierungen werden Schritt für Schritt die Ressourcen und Tools für systematische korpus- und datenbankgesteuerte Recherchen illustriert: Nutzungsmöglichkeiten der Token-, Kontext-, Metadaten- und Positionssuche werden gezeigt, jeweils in Bezug auf und im wechselseitigen Verhältnis mit qualitativen Fallanalysen, auch mit Belegannotationen nach analyserelevanten (strukturellen und funktionalen) Kategorien. Schließlich wird das heißt als weiterer Reformulierungsindikator für eine vergleichende Analyse herangezogen. Dieser Beitrag stellt eine detailliertere Ausarbeitung einer kürzeren, eher technisch-didaktischen Online-Handreichung (Kaiser/ Schmidt 2016) zu diesem Thema dar, und hat einen stärker inhaltlich-analytischen Fokus.