AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited   Transcriptions

Martin Ki\v{s}\v{s}; Karel Bene\v{s}; Michal Hradi\v{s}

arXiv:2104.13037·cs.CV·January 26, 2022

AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions

Martin Ki\v{s}\v{s}, Karel Bene\v{s}, Michal Hradi\v{s}

PDF

1 Repo

TL;DR

This paper introduces AT-ST, a self-training strategy for OCR in low-resource domains that leverages related data and data augmentation to significantly reduce transcription errors.

Contribution

The paper presents a novel self-training approach with confidence-based data selection and aggressive masking augmentation for improved OCR in limited annotation scenarios.

Findings

01

Achieves up to 55% reduction in character error rate for handwritten OCR.

02

Achieves up to 38% reduction for printed OCR.

03

Data augmentation reduces error rate by about 10%.

Abstract

This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as when transcribing a collection of single person's correspondence or a large manuscript. We propose to train a seed system on large scale data from related domains mixed with available annotated data from the target domain. The seed system transcribes the unannotated data from the target domain which is then used to train a better system. We study several confidence measures and eventually decide to use the posterior probability of a transcription for data selection. Additionally, we propose to augment the data using an aggressive masking scheme. By self-training, we achieve up to 55 % reduction in character error rate for handwritten data and up to 38 % on printed data. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DCGM/pero-ocr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.