OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang; Yongxin Shi; Dezhi Peng; Songxuan Lai; Zecheng Xie; Lianwen Jin

arXiv:2505.17163·cs.LG·May 26, 2025

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

PDF

3 Reviews

TL;DR

This paper introduces OCR-Reasoning, a new benchmark for evaluating multimodal large language models on complex text-rich image reasoning tasks, revealing significant limitations of current models in this challenging domain.

Contribution

The paper presents OCR-Reasoning, a comprehensive benchmark with annotated reasoning processes, enabling systematic assessment of MLLMs' capabilities in text-rich visual reasoning.

Findings

01

Current MLLMs perform poorly on OCR-Reasoning, with accuracy below 50%.

02

The benchmark highlights significant gaps in models' reasoning abilities.

03

OCR-Reasoning provides a new standard for evaluating text-rich image reasoning.

Abstract

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. OCR-Reasoning is the first benchmark to systematically assess reasoning processes in text-rich image scenarios, addressing a long-overlooked need. 2. The comprehensive evaluation includes multiple model categories and zero-shot settings, ensuring generalizable results. 3. Detailed error analysis and qualitative case studies deepen understanding of model limitations beyond accuracy metrics.

Weaknesses

1. While the handwritten data in OCR-Reasoning provides valuable transcribed college-level STEM problems, it would be beneficial to consider incorporating more everyday real-world handwritten scenarios to further enhance the benchmark's coverage of diverse text-rich reasoning tasks commonly encountered in practice. 2. The paper presents an interesting observation that CoT prompting may have backfired on VL-Rethinker-7B, potentially due to conflicting built-in reflection mechanisms. It would str

Reviewer 02Rating 8Confidence 4

Strengths

1. Filling text-rich image reasoning evaluation gaps: Existing text-rich image benchmarks focus on text extraction but lack systematic reasoning assessment. OCR-Reasoning addresses this, measuring MLLMs’ reasoning in practical scenarios. 2. Sample design forcing reasoning: Few answers in its samples are directly extractable from OCR results; models must actively reason, avoiding reliance on text extraction to truly reflect their reasoning levels. 3. Comprehensive annotations for in-depth evaluat

Weaknesses

1. Limited dataset scale: Most of the data collection and annotation processes rely on manual work, and the high associated costs result in the dataset scale being only comparable to previous methods, failing to achieve larger-scale expansion

Reviewer 03Rating 6Confidence 4

Strengths

The paper demonstrates strong originality by introducing a new benchmark, OCR-Reasoning, that evaluates multimodal large language models on text-rich image reasoning—a domain previously underserved by existing datasets focused mainly on text extraction. Its dual annotation of final answers and reasoning steps represents a creative and meaningful extension of prior benchmarks. In terms of quality, the study employs a rigorous and transparent methodology, including systematic dataset curation, ex

Weaknesses

The dataset is relatively small (1,069 samples), limiting generalization and coverage of diverse real-world scenarios. The reliance on LLM-as-Judge introduces potential bias; incorporating human or cross-model validation would improve reliability. The paper lacks deeper diagnostic analysis explaining why models fail, and provides limited quantitative comparison with prior benchmarks. Finally, details on dataset release and reproducibility are insufficient, which may hinder adoption.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.