Word-level alignment of paper documents with their electronic full-text counterparts
- We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre- and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.
Author: | Mark-Christoph MüllerORCiDGND, Sucheta GhoshORCiD, Ulrike WittigORCiDGND, Maja Rey |
---|---|
URN: | urn:nbn:de:bsz:mh39-110839 |
DOI: | https://doi.org/10.18653/v1/2021.bionlp-1.19 |
ISBN: | 978-1-954085-40-4 |
Parent Title (English): | Proceedings of the 20th Workshop on Biomedical Language Processing. June 11, 2021 |
Publisher: | Association for Computational Linguistics |
Place of publication: | Stroudsburg, Pennsylvania |
Editor: | Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2021 |
Date of Publication (online): | 2022/06/10 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | biomedical language processing; document triage; manual database curation; word-level alignment |
GND Keyword: | Ausrichten <Technik>; Computerlinguistik; Optische Zeichenerkennung; Volltext; XML |
First Page: | 168 |
Last Page: | 179 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Computerlinguistik |
Licence (English): | ![]() |