Volltext-Downloads (blau) und Frontdoor-Views (grau)
  • search hit 67 of 262
Back to Result List

Robust extraction of marked-up text sections from scientific document printouts

  • We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of color-highlighted text sections, as well as on providing a framework for downstream processing of extraction results. The tool can be useful for document analysis tasks that must, or benefit from being able to, use printed paper.

Download full text files

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Mark-Christoph MüllerORCiDGND
URN:urn:nbn:de:bsz:mh39-124275
URL:https://iapr.org/archives/das2022/wp-content/uploads/2022/05/DAS2022_short-paper-booklet.pdf
Parent Title (English):15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022
Publisher:La Rochelle University
Place of publication:La Rochelle
Document Type:Conference Proceeding
Language:English
Year of first Publication:2022
Date of Publication (online):2024/01/09
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:Document Images; Information Extraction; Multi-modality; Natural Language Processing; OCR
GND Keyword:Dokument; Kategorisierung; Optische Zeichenerkennung; Schriftstück; Texttechnologie
First Page:32
Last Page:35
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Linguistics-Classification:Textlinguistik / Schriftsprache
Licence (German):License LogoUrheberrechtlich geschützt