Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Yige Xu; Yongjie Wang; Zizhuo Wu; Kaisong Song; Jun Lin; Zhiqi Shen

arXiv:2604.16256·cs.CV·April 20, 2026

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, Zhiqi Shen

PDF

1 Repo

TL;DR

This paper introduces CrossMath, a benchmark to evaluate vision-language models' reasoning across modalities, revealing they mainly perform reasoning in the textual domain and benefit from fine-tuning on modality-aligned data.

Contribution

The study provides a controlled benchmark for modality-specific reasoning and demonstrates that current VLMs rely heavily on textual reasoning, with limited visual reasoning capabilities.

Findings

01

VLMs perform better with text-only inputs than with combined image+text inputs.

02

Fine-tuning on CrossMath significantly improves reasoning performance across modalities.

03

Current VLMs show a substantial modality gap, favoring textual reasoning over visual reasoning.

Abstract

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xuyige/CrossMath
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.