An Effective Data Augmentation Method by Asking Questions about Scene Text Images
Xu Yao, Lei Kang

TL;DR
This paper introduces a novel data augmentation approach for scene and handwritten text recognition that uses question-answering tasks to improve OCR models' reasoning about text structure, leading to better accuracy.
Contribution
It proposes a VQA-inspired framework that generates natural-language questions about text attributes to enhance OCR training with structured reasoning tasks.
Findings
Significant reduction in CER and WER on WordArt and Esposalles datasets.
Consistent improvements over baseline OCR models.
Enhanced reasoning about character-level text attributes.
Abstract
Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling
