Creating an extensible, levelled study corpus of Russian

In this paper, we present first results of training a classifier for discriminating Russian texts into different levels of difficulty. For the classification we considered both surface-oriented features adopted from readability assessments and more linguistically informed, positional features to classify texts into two levels of difficulty. This text classification is the main focus of our Levelled Study Corpus of Russian (LeStCoR), in which we aim to build a corpus adapted for language learning purposes – selecting simpler texts for beginner second language learners and more complex texts for advanced learners. The most discriminative feature in our pilot study was a lexical feature that approximates accessibility of the vocabulary by the second language learner in terms of the proportion of familiar words in the texts. The best feature setting achieved an accuracy of 0.91 on a pilot corpus of 209 texts.

Metadaten
Author:	Dolores Batinić, Sandra Birzer, Heike Zinsmeister
URN:	urn:nbn:de:bsz:mh39-59235
URL:	https://www.linguistics.rub.de/bla/
ISSN:	2190-0949
Parent Title (English):	Proceedings of the 13th Conference on Natural Language Processing (KONVENS) Bochum, Germany September 19–21, 2016
Series (Serial Number):	Bochumer Linguistische Arbeitsberichte (16)
Publisher:	Ruhr-Universität Bochum
Place of publication:	Bochum
Editor:	Stefanie Dipper, Friedrich Neubarth, Heike Zinsmeister
Document Type:	Part of a Book
Language:	English
Year of first Publication:	2016
Date of Publication (online):	2017/02/27
Publicationstate:	Veröffentlichungsversion
Reviewstate:	(Verlags)-Lektorat
Tag:	Levelled Study Corpus of Russian (LeStCoR)
GND Keyword:	Korpus <Linguistik>; Russisch
First Page:	38
Last Page:	43
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Linguistics-Classification:	Korpuslinguistik
Licence (German):	Urheberrechtlich geschützt

Open Access