OPUS 4 | 400 Sprache, Linguistik

400 Sprache, Linguistik

400 Sprache (135)
401 Sprachphilosophie, Sprachtheorie (2)
402 Verschiedenes
403 Wörterbücher, Enzyklopädien
404 Spezielle Themen (1)
405 Fortlaufende Sammelwerke
406 Organisationen, Management
407 Ausbildung, Forschung, verwandte Themen (1)
408 Behandlung nach Personengruppen
409 Geografische, personenbezogene Behandlung

Refine

Has Fulltext

yes (2)

2 search hits

1 to 2

Sort by

Robust extraction of marked-up text sections from scientific document printouts (2022)

Müller, Mark-Christoph

We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of color-highlighted text sections, as well as on providing a framework for downstream processing of extraction results. The tool can be useful for document analysis tasks that must, or benefit from being able to, use printed paper.

Word-level alignment of paper documents with their electronic full-text counterparts (2021)

Müller, Mark-Christoph ; Ghosh, Sucheta ; Wittig, Ulrike ; Rey, Maja

We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre- and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.

1 to 2

Open Access

400 Sprache, Linguistik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

2 search hits