TL;DR
This paper introduces OCR-Reasoning, a new benchmark for evaluating multimodal large language models on complex text-rich image reasoning tasks, revealing significant limitations of current models in this challenging domain.
Contribution
The paper presents OCR-Reasoning, a comprehensive benchmark with annotated reasoning processes, enabling systematic assessment of MLLMs' capabilities in text-rich visual reasoning.
Findings
Current MLLMs perform poorly on OCR-Reasoning, with accuracy below 50%.
The benchmark highlights significant gaps in models' reasoning abilities.
OCR-Reasoning provides a new standard for evaluating text-rich image reasoning.
Abstract
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but…
Peer Reviews
Decision·ICLR 2026 Poster
1. OCR-Reasoning is the first benchmark to systematically assess reasoning processes in text-rich image scenarios, addressing a long-overlooked need. 2. The comprehensive evaluation includes multiple model categories and zero-shot settings, ensuring generalizable results. 3. Detailed error analysis and qualitative case studies deepen understanding of model limitations beyond accuracy metrics.
1. While the handwritten data in OCR-Reasoning provides valuable transcribed college-level STEM problems, it would be beneficial to consider incorporating more everyday real-world handwritten scenarios to further enhance the benchmark's coverage of diverse text-rich reasoning tasks commonly encountered in practice. 2. The paper presents an interesting observation that CoT prompting may have backfired on VL-Rethinker-7B, potentially due to conflicting built-in reflection mechanisms. It would str
1. Filling text-rich image reasoning evaluation gaps: Existing text-rich image benchmarks focus on text extraction but lack systematic reasoning assessment. OCR-Reasoning addresses this, measuring MLLMs’ reasoning in practical scenarios. 2. Sample design forcing reasoning: Few answers in its samples are directly extractable from OCR results; models must actively reason, avoiding reliance on text extraction to truly reflect their reasoning levels. 3. Comprehensive annotations for in-depth evaluat
1. Limited dataset scale: Most of the data collection and annotation processes rely on manual work, and the high associated costs result in the dataset scale being only comparable to previous methods, failing to achieve larger-scale expansion
The paper demonstrates strong originality by introducing a new benchmark, OCR-Reasoning, that evaluates multimodal large language models on text-rich image reasoning—a domain previously underserved by existing datasets focused mainly on text extraction. Its dual annotation of final answers and reasoning steps represents a creative and meaningful extension of prior benchmarks. In terms of quality, the study employs a rigorous and transparent methodology, including systematic dataset curation, ex
The dataset is relatively small (1,069 samples), limiting generalization and coverage of diverse real-world scenarios. The reliance on LLM-as-Judge introduces potential bias; incorporating human or cross-model validation would improve reliability. The paper lacks deeper diagnostic analysis explaining why models fail, and provides limited quantitative comparison with prior benchmarks. Finally, details on dataset release and reproducibility are insufficient, which may hinder adoption.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
