Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani; Daisy Rosenblum; Antonios Anastasopoulos; Graham; Neubig

arXiv:2111.02622·cs.CL·November 5, 2021

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham, Neubig

PDF

Open Access 1 Repo

TL;DR

This paper introduces a semi-supervised learning approach with lexically-aware decoding for OCR post-correction, significantly improving recognition accuracy on endangered languages by leveraging raw images and language models.

Contribution

It proposes a novel semi-supervised self-training method combined with lexically-aware decoding using WFSA, enabling better OCR post-correction with limited annotated data.

Findings

01

Achieved 15-29% relative error reduction on four endangered languages.

02

Demonstrated the effectiveness of combining self-training with lexically-aware decoding.

03

Provided open-source data and code for reproducibility.

Abstract

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shrutirij/ocr-post-correction
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies