400 Sprache, Linguistik
Refine
Document Type
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- yes (2)
Keywords
- Natural Language Processing (2) (remove)
Publicationstate
Reviewstate
- Peer-Review (2)
Publisher
- La Rochelle University (1)
- Zenodo (1)
We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of color-highlighted text sections, as well as on providing a framework for downstream processing of extraction results. The tool can be useful for document analysis tasks that must, or benefit from being able to, use printed paper.
In order to differentiate between figurative and literal usage of verb-noun combinations for the shared task on the disambiguation of German Verbal Idioms issued for KONVENS 2021, we apply and extend an approach originally developed for detecting idioms in a dataset consisting of random ngram samples. The classification is done by implementing a rather shallow, statistics-based pipeline without intensive preprocessing and examinations on the morphosyntactic and semantic level. We describe the overall approach, the differences between the original dataset and the dataset of the KONVENS task, provide experimental classification results, and analyse the individual contributions of our feature sets.