Volltext-Downloads (blau) und Frontdoor-Views (grau)
The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 25 of 10107
Back to Result List

SCyDia – OCR for Serbian Cyrillic with diacritics

  • In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the biggest obstacle is the lack of machine readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.

Download full text files

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Velibor Ilić, Lenka Bajčetić, Snežana PetrovićGND, Ana Španović
URN:urn:nbn:de:bsz:mh39-111953
URL:https://euralex2022.ids-mannheim.de/wp-content/uploads/2022/07/Proceedings_11.07.2022.pdf
DOI:https://doi.org/10.14618/ids-pub-11195
ISBN:978-3-937241-87-6
Parent Title (English):Dictionaries and Society. Proceedings of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany
Publisher:IDS-Verlag
Place of publication:Mannheim
Editor:Annette Klosa-Kückelhaus, Stefan Engelberg, Christine Möhrs, Petra Storjohann
Document Type:Part of a Book
Language:English
Year of first Publication:2022
Date of Publication (online):2022/08/23
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:Cyrillic; OCR; SCyDia; Serbian language; convolutional neural networks; retro-digitization
GND Keyword:Dialektologie; Digitalisierung; Elektronisches Wörterbuch; Kyrillische Schrift <Druckschrift>; OCR-Verarbeitung; Serbisch
First Page:387
Last Page:400
DDC classes:400 Sprache / 420 Englisch
Open Access?:ja
Linguistics-Classification:Lexikografie
Conferences, Workshops:Dictionaries and Society. Proceedings of the XX EURALEX International Congress, 12-16 July 2022, Mannheim, Germany
Licence (German):License LogoCreative Commons - CC BY-SA - Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International