Robust extraction of marked-up text sections from scientific document printouts

We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of color-highlighted text sections, as well as on providing a framework for downstream processing of extraction results. The tool can be useful for document analysis tasks that must, or benefit from being able to, use printed paper.

Metadaten
Author:	Mark-Christoph Müller ORCiD GND
URN:	urn:nbn:de:bsz:mh39-124275
URL:	https://iapr.org/archives/das2022/wp-content/uploads/2022/05/DAS2022_short-paper-booklet.pdf
Parent Title (English):	15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022
Publisher:	La Rochelle University
Place of publication:	La Rochelle
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2022
Date of Publication (online):	2024/01/09
Publishing Institution:	Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	Document Images; Information Extraction; Multi-modality; Natural Language Processing; OCR
GND Keyword:	Dokument; Kategorisierung; Optische Zeichenerkennung; Schriftstück; Texttechnologie
First Page:	32
Last Page:	35
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Linguistics-Classification:	Textlinguistik / Schriftsprache
Licence (German):	Urheberrechtlich geschützt

Open Access