LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye; Haibin He; Qihuang Zhong; Jing Zhang; Juhua Liu; Bo Du

arXiv:2505.12307·cs.CV·November 27, 2025

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye, Haibin He, Qihuang Zhong, Jing Zhang, Juhua Liu, Bo Du

PDF

Open Access 1 Repo

TL;DR

LogicOCR introduces a comprehensive benchmark to evaluate large multimodal models' logical reasoning on text-rich images, revealing current limitations and proposing a training-free method to improve their perception of textual cues.

Contribution

The paper presents a new benchmark, LogicOCR, with generated and real-world images, and proposes TextCue, a training-free method to enhance multimodal reasoning in large models.

Findings

01

LMMs lag behind text-only models in multimodal reasoning.

02

Test-time scaling and input modality significantly affect performance.

03

TextCue improves accuracy by approximately 1.8% in experiments.

Abstract

Recent advances in Large Multimodal Models (LMMs) have revolutionized their reasoning and Optical Character Recognition (OCR) capabilities. However, their complex logical reasoning performance on text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 2780 questions with two subsets, i.e., LogicOCR-Gen with 1100 multi-choice questions on generated images, and LogicOCR-Real with 1680 meticulously designed free-form questions on real-world images. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mililab/logicocr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Topic Modeling