Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham, Neubig

TL;DR
This paper introduces a semi-supervised learning approach with lexically-aware decoding for OCR post-correction, significantly improving recognition accuracy on endangered languages by leveraging raw images and language models.
Contribution
It proposes a novel semi-supervised self-training method combined with lexically-aware decoding using WFSA, enabling better OCR post-correction with limited annotated data.
Findings
Achieved 15-29% relative error reduction on four endangered languages.
Demonstrated the effectiveness of combining self-training with lexically-aware decoding.
Provided open-source data and code for reproducibility.
Abstract
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies
