Robust extraction of marked-up text sections from scientific document printouts
- We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of color-highlighted text sections, as well as on providing a framework for downstream processing of extraction results. The tool can be useful for document analysis tasks that must, or benefit from being able to, use printed paper.
Author: | Mark-Christoph MüllerORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-124275 |
URL: | https://iapr.org/archives/das2022/wp-content/uploads/2022/05/DAS2022_short-paper-booklet.pdf |
Parent Title (English): | 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022 |
Publisher: | La Rochelle University |
Place of publication: | La Rochelle |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2022 |
Date of Publication (online): | 2024/01/09 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | Document Images; Information Extraction; Multi-modality; Natural Language Processing; OCR |
GND Keyword: | Dokument; Kategorisierung; Optische Zeichenerkennung; Schriftstück; Texttechnologie |
First Page: | 32 |
Last Page: | 35 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Textlinguistik / Schriftsprache |
Licence (German): | Urheberrechtlich geschützt |