SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Yibo Peng; Peng Xia; Ding Zhong; Kaide Zeng; Siwei Han; Yiyang Zhou; Jiaqi Liu; Ruiyi Zhang; Huaxiu Yao

arXiv:2602.22426·cs.CV·February 27, 2026

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao

PDF

Open Access

TL;DR

SimpleOCR is a training strategy that enhances multimodal models' ability to read embedded text in images by structurally forcing visual engagement, leading to significant performance improvements across benchmarks.

Contribution

It introduces SimpleOCR, a plug-and-play training method that mitigates modality laziness in MLLMs by structurally constraining training with visualized questions.

Findings

01

SimpleOCR improves model performance by up to 5.4% on OOD benchmarks.

02

Models show a 12.7% performance drop in the VQ setting without SimpleOCR.

03

SimpleOCR achieves high data efficiency, outperforming RL-based methods with fewer samples.

Abstract

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques