Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian
- The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.
Author: | Dan TufișGND, Verginica Barbu MititeluGND, Elena Irimia, Vasile Păiș, Radu Ion, Nils DiewaldGND, Maria Mitrofan, Mihaela OnofreiORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-93851 |
URL: | http://www.lingv.ro/index.php?option=com_content&view=article&id=342%3Arrl-arhiva-2019&catid=36%3Areviste-ilb&Itemid=95 |
ISSN: | 0035-3957 |
Parent Title (Multiple languages): | Revue Roumaine de Linguistique. On design, creation and use of of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo |
Publisher: | Editura Academiei Române |
Place of publication: | Bucureşti |
Document Type: | Article |
Language: | English |
Year of first Publication: | 2019 |
Date of Publication (online): | 2019/11/11 |
Publicationstate: | Zweitveröffentlichung |
Reviewstate: | Peer-Review |
Tag: | Romanian corpus; acquisition; annotation; metadata; query |
GND Keyword: | Annotation; Korpus <Linguistik>; Metadaten; Rumänisch |
Volume: | 64 |
Issue: | 3 |
First Page: | 227 |
Last Page: | 240 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Leibniz-Classification: | Sprache, Linguistik |
Linguistics-Classification: | Korpuslinguistik |
Program areas: | Digitale Sprachwissenschaft |
Licence (German): | Urheberrechtlich geschützt |