Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts
Mark-Christoph M\"uller, Sucheta Ghosh, Ulrike Wittig, and Maja Rey

TL;DR
This paper presents an unsupervised, simple method for automatically aligning words between printed documents and their full-text digital versions, achieving high accuracy and potential for various document processing applications.
Contribution
It introduces a novel, unsupervised approach using standard components for precise word-level alignment between printed and digital documents.
Findings
Achieved an F-score of 85.01 in basic setup
Improved F-score to 86.63 with pre- and post-processing
Applicable to database curation and biomedical OCR tasks
Abstract
We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre- and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Biomedical Text Mining and Ontologies · Scientific Computing and Data Management
