Volltext-Downloads (blau) und Frontdoor-Views (grau)

CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language

  • This article reports on the on-going CoRoLa project, aiming at creating a reference corpus of contemporary Romanian (from 1945 onwards), opened for online free exploitation by researchers in linguistics and language processing, teachers of Romanian, students. We invest serious efforts in persuading large publishing houses and other owners of IPR on relevant language data to join us and contribute the project with selections of their text and speech repositories. The CoRoLa project is coordinated by two Computer Science institutes of the Romanian Academy, but enjoys cooperation of and consulting from professional linguists from other institutes of the Romanian Academy. We foresee a written component of the corpus of more than 500 million word forms, and a speech component of about 300 hours of recordings. The entire collection of texts (covering all functional styles of the language) will be pre-processed and annotated at several levels, and also documented with standardized metadata. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will include morpho-lexical tagging and lemmatization in the first stage, followed by syntactic, semantic and discourse annotation in a later stage.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Dan Tufiș, Verginica Barbu Mititelu, Elena Irimia, Ștefan Daniel Dumitrescu, Tiberiu Boroș, Horia Nicolai Teodorescu, Dan Cristea, Andrei Scutelnicu, Cecilia Bolea, Alex Moruz, Laura Pistol
URN:urn:nbn:de:bsz:mh39-38385
Parent Title (English):Proceedings of the 3rd Workshop on Challenges in the Management of Large Corpora (CMLC-3), Lancaster, 20 July 2015.
Publisher:Institut für Deutsche Sprache
Place of publication:Mannheim
Editor:Piotr Bański, Hanno Biber, Evelyn Breiteneder, Marc Kupietz, Harald Lüngen, Andreas Witt
Document Type:Conference Proceeding
Language:English
Year of first Publication:2015
Date of Publication (online):2015/07/02
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:Corpus linguistics; Distributional semantics; Relation extraction; Semantic similarity; Text mining
GND Keyword:Korpus <Linguistik>; Text Mining; Ähnlichkeitssuche
First Page:5
Last Page:10
Dewey Decimal Classification:400 Sprache / 410 Linguistik
Linguistics-Classification:Korpuslinguistik
Conferences, Workshops:CMLC-3 / 3rd Workshop on Challenges in the Management of Large Corpora
Open Access?:Ja
Licence (German):License LogoCreative Commons - Namensnennung-Keine kommerzielle Nutzung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland