Refine
Year of publication
- 2016 (47) (remove)
Document Type
- Conference Proceeding (21)
- Part of a Book (10)
- Article (8)
- Book (3)
- Working Paper (3)
- Doctoral Thesis (1)
- Part of Periodical (1)
Keywords
- Korpus <Linguistik> (47) (remove)
Publicationstate
- Veröffentlichungsversion (31)
- Zweitveröffentlichung (5)
- Postprint (2)
Reviewstate
Publisher
- European Language Resources Association (ELRA) (5)
- Nisaba (4)
- Winter (4)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (3)
- Institut für Deutsche Sprache (3)
- European Language Resources Association (2)
- de Gruyter (2)
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Amsterdam [u.a.] (1)
- Association for Computational Linguistics (1)
The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g., title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection).
The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g. title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection).
Dieser Beitrag stellt nach einer kurzen allgemeinen Einführung die Datenbank für Gesprochenes Deutsch (DGD) und das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) als Instrumente speziell für gesprächsanalytisches Arbeiten vor. Anhand des Beispiels sprich als Diskursmarker für Reformulierungen werden Schritt für Schritt die Ressourcen und Tools für systematische korpus- und datenbankgesteuerte Recherchen illustriert: Nutzungsmöglichkeiten der Token-, Kontext-, Metadaten- und Positionssuche werden gezeigt, jeweils in Bezug auf und im wechselseitigen Verhältnis mit qualitativen Fallanalysen, auch mit Belegannotationen nach analyserelevanten (strukturellen und funktionalen) Kategorien. Schließlich wird das heißt als weiterer Reformulierungsindikator für eine vergleichende Analyse herangezogen. Dieser Beitrag stellt eine detailliertere Ausarbeitung einer kürzeren, eher technisch-didaktischen Online-Handreichung (Kaiser/ Schmidt 2016) zu diesem Thema dar, und hat einen stärker inhaltlich-analytischen Fokus.
Editorial
(2016)
Journal for language technology and computational linguistics. Corpus linguistic software tools
(2016)
With the growing availability and importance of (large) corpora in all fields of linguistics, the role of software tools is gradually moving from useful, possibly intelligent informationtechnological “helpers” towards scientific instruments that are as integral parts of the research process as data, methodology and interpretations. Both aspects are present in this special issue of JLCL on corpus linguistic software tools.
This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank für Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.
The IFCASL corpus is a French-German bilingual phonetic learner corpus designed, recorded and annotated in a project on individualized feedback in computer-assisted spoken language learning. The motivation for setting up this corpus was that there is no phonetically annotated and segmented corpus for this language pair of comparable of size and coverage. In contrast to most learner corpora, the IFCASL corpus incorporate data for a language pair in both directions, i.e. in our case French learners of German, and German learners of French. In addition, the corpus is complemented by two sub-corpora of native speech by the same speakers. The corpus provides spoken data by about 100 speakers with comparable productions, annotated and segmented on the word and the phone level, with more than 50% manually corrected data. The paper reports on inter-annotator agreement and the optimization of the acoustic models for forced speech-text alignment in exercises for computer-assisted pronunciation training. Example studies based on the corpus data with a phonetic focus include topics such as the realization of /h/ and glottal stop, final devoicing of obstruents, vowel quantity and quality, pitch range, and tempo.
The current paper presents a corpus containing 35 dialogues of spontaneously spoken southern German, including half an hour of articulography for 13 of the speakers. Speakers were seated in separate recording chambers, mimicking a telephone call, and recorded on individual audio channels. The corpus provides manually corrected word boundaries and automatically aligned segment boundaries. Annotations are provided in the Praat format. In addition to audio recordings, speakers filled out a detailed questionnaire, assessing among others their audio-visual consumption habits.
Die Rolle der antizipatorischen Verstehensdokumentation erweist sich in den Interviews aus dem Israelkorpus m. E. als besonders wichtig. Es wird von der Tatsache ausgegangen, dass es sich bei den Informanten um Personen mit besonders delikaten biographischen Hintergründen handele. Die Interviewerinnen müssen demzufolge mit der starken emotionalen Belastung rechnen, der die Interviewten während der Rekonstruktion ihrer Lebensgeschichte ausgesetzt sind. Ein sehr direkter Frage-Antwort-Stil könnte wegen dieser emotionalen Belastung als unangenehm empfunden werden. Der Einsatz von Verfahren antizipatorischer Verstehensdokumentation weist stattdessen m. E. eindeutig darauf hin, wie sich die Interviewerinnen offensichtlich um Empathie bemühen und im Sinne einer intersubjektiven Inreraktionskonstitution mit den Interviewten kooperieren. Ziel dieses Beitrages ist es zu zeigen, wie solche Verfahren der antizipatorischen Verstehensdokumentation durch den systematischen Einsatz der Konnektoren und, also, dann realisiert werden können.