Volltext-Downloads (blau) und Frontdoor-Views (grau)
  • search hit 2 of 2
Back to Result List

OCR Nachkorrektur des Royal Society Corpus

  • We present an approach for automatic detection and correction of OCR-induced misspellings in historical texts. The main objective is the post-correction of the digitized Royal Society Corpus, a set of historical documents from 1665 to 1869. Due to the aged material the OCR procedure has made mistakes, thus leading to files corrupted by thousands of misspellings. This motivates a post processing step. The current correction technique is a pattern-based approach which due to its lack of generalization suffers from bad recall. To generalize from the patterns we propose to use the noisy channel model. From the pattern based substitutions we train a corpus specific error model complemented with a language model. With an F1-Score of 0.61 the presented technique significantly outperforms the pattern based approach which has an F1-score of 0.28. Due to its more accurate error model it also outperforms other implementations of the noisy channel model.

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Carsten Klaus, Peter Fankhauser, Dietrich Klakow
URN:urn:nbn:de:bsz:mh39-85353
URL:https://dhd2019.org/programm/do/postersession/poster-147/
URL:https://zenodo.org/record/2596095#.XKx0hqTgqUk
DOI:https://doi.org/10.5281/zenodo.2596095
Parent Title (German):6. Jahrestagung des Verbands Digital Humanities im deutschsprachigen Raum, Multimedial & Multimodal, 25.03. - 29.03.2019, Frankfurt & Mainz [DHd 2019 Digital Humanities: multimedial & multimodal. Konferenzabstracts]
Publisher:Zenodo
Place of publication:Frankfurt [u.a.]
Editor:Patrick Sahle
Document Type:Conference Proceeding
Language:English
Year of first Publication:2019
Date of Publication (online):2019/02/27
Contributing Corporation:Verband Digital Humanities im deutschsprachigen Raum
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
GND Keyword:Automatische Sprachverarbeitung; Digital Humanities; Korrektur; OCR-Schrift
First Page:337
Last Page:339
Dewey Decimal Classification:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Computerlinguistik
Licence (English):License LogoCreative Commons - Attribution-NoDerivs 4.0 International