An Effective Data Augmentation Method by Asking Questions about Scene Text Images

Xu Yao; Lei Kang

arXiv:2603.03580·cs.CV·March 5, 2026

An Effective Data Augmentation Method by Asking Questions about Scene Text Images

Xu Yao, Lei Kang

PDF

Open Access

TL;DR

This paper introduces a novel data augmentation approach for scene and handwritten text recognition that uses question-answering tasks to improve OCR models' reasoning about text structure, leading to better accuracy.

Contribution

It proposes a VQA-inspired framework that generates natural-language questions about text attributes to enhance OCR training with structured reasoning tasks.

Findings

01

Significant reduction in CER and WER on WordArt and Esposalles datasets.

02

Consistent improvements over baseline OCR models.

03

Enhanced reasoning about character-level text attributes.

Abstract

Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling