SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao

TL;DR
SimpleOCR is a training strategy that enhances multimodal models' ability to read embedded text in images by structurally forcing visual engagement, leading to significant performance improvements across benchmarks.
Contribution
It introduces SimpleOCR, a plug-and-play training method that mitigates modality laziness in MLLMs by structurally constraining training with visualized questions.
Findings
SimpleOCR improves model performance by up to 5.4% on OOD benchmarks.
Models show a 12.7% performance drop in the VQ setting without SimpleOCR.
SimpleOCR achieves high data efficiency, outperforming RL-based methods with fewer samples.
Abstract
Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
