Refine
Year of publication
- 2016 (47) (remove)
Document Type
- Conference Proceeding (21)
- Part of a Book (10)
- Article (8)
- Book (3)
- Working Paper (3)
- Doctoral Thesis (1)
- Part of Periodical (1)
Keywords
- Korpus <Linguistik> (47) (remove)
Publicationstate
- Veröffentlichungsversion (31)
- Zweitveröffentlichung (5)
- Postprint (2)
Reviewstate
Publisher
- European Language Resources Association (ELRA) (5)
- Nisaba (4)
- Winter (4)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (3)
- Institut für Deutsche Sprache (3)
- European Language Resources Association (2)
- de Gruyter (2)
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Amsterdam [u.a.] (1)
- Association for Computational Linguistics (1)
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.
Dieser Beitrag stellt nach einer kurzen allgemeinen Einführung die Datenbank für Gesprochenes Deutsch (DGD) und das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) als Instrumente speziell für gesprächsanalytisches Arbeiten vor. Anhand des Beispiels sprich als Diskursmarker für Reformulierungen werden Schritt für Schritt die Ressourcen und Tools für systematische korpus- und datenbankgesteuerte Recherchen illustriert: Nutzungsmöglichkeiten der Token-, Kontext-, Metadaten- und Positionssuche werden gezeigt, jeweils in Bezug auf und im wechselseitigen Verhältnis mit qualitativen Fallanalysen, auch mit Belegannotationen nach analyserelevanten (strukturellen und funktionalen) Kategorien. Schließlich wird das heißt als weiterer Reformulierungsindikator für eine vergleichende Analyse herangezogen. Dieser Beitrag stellt eine detailliertere Ausarbeitung einer kürzeren, eher technisch-didaktischen Online-Handreichung (Kaiser/ Schmidt 2016) zu diesem Thema dar, und hat einen stärker inhaltlich-analytischen Fokus.
The current paper presents a corpus containing 35 dialogues of spontaneously spoken southern German, including half an hour of articulography for 13 of the speakers. Speakers were seated in separate recording chambers, mimicking a telephone call, and recorded on individual audio channels. The corpus provides manually corrected word boundaries and automatically aligned segment boundaries. Annotations are provided in the Praat format. In addition to audio recordings, speakers filled out a detailed questionnaire, assessing among others their audio-visual consumption habits.
In this paper, we present first results of training a classifier for discriminating Russian texts into different levels of difficulty. For the classification we considered both surface-oriented features adopted from readability assessments and more linguistically informed, positional features to classify texts into two levels of difficulty. This text classification is the main focus of our Levelled Study Corpus of Russian (LeStCoR), in which we aim to build a corpus adapted for language learning purposes – selecting simpler texts for beginner second language learners and more complex texts for advanced learners. The most discriminative feature in our pilot study was a lexical feature that approximates accessibility of the vocabulary by the second language learner in terms of the proportion of familiar words in the texts. The best feature setting achieved an accuracy of 0.91 on a pilot corpus of 209 texts.
The IFCASL corpus is a French-German bilingual phonetic learner corpus designed, recorded and annotated in a project on individualized feedback in computer-assisted spoken language learning. The motivation for setting up this corpus was that there is no phonetically annotated and segmented corpus for this language pair of comparable of size and coverage. In contrast to most learner corpora, the IFCASL corpus incorporate data for a language pair in both directions, i.e. in our case French learners of German, and German learners of French. In addition, the corpus is complemented by two sub-corpora of native speech by the same speakers. The corpus provides spoken data by about 100 speakers with comparable productions, annotated and segmented on the word and the phone level, with more than 50% manually corrected data. The paper reports on inter-annotator agreement and the optimization of the acoustic models for forced speech-text alignment in exercises for computer-assisted pronunciation training. Example studies based on the corpus data with a phonetic focus include topics such as the realization of /h/ and glottal stop, final devoicing of obstruents, vowel quantity and quality, pitch range, and tempo.
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.
When becoming integrated into the German vocabulary, foreign words reflect paradigmatic changes regarding orthography, grammar as well as semantics. In this context,German orthography is also highly determined by orthographic codification, which continues to influence the development of spelling to the present day. This study compares digital linguistically annotated corpora containing texts written by professional as well as non-professional writers; these corpora contain several billion foreign words (of Greek, Latin and French origin, and in the second part of the study of English/American and Italian origin), studied over a period of 20 years following the German orthographic reform of 1996. The results may potentially help the official regulations to adapt to the spelling practices observed – either by describing the rules more precisely or by proposing possible spelling variants or eliminating those which are not in common use. The study may also help to support correct lexicographic codification in dictionaries.