TL;DR
This paper introduces CrossMath, a benchmark to evaluate vision-language models' reasoning across modalities, revealing they mainly perform reasoning in the textual domain and benefit from fine-tuning on modality-aligned data.
Contribution
The study provides a controlled benchmark for modality-specific reasoning and demonstrates that current VLMs rely heavily on textual reasoning, with limited visual reasoning capabilities.
Findings
VLMs perform better with text-only inputs than with combined image+text inputs.
Fine-tuning on CrossMath significantly improves reasoning performance across modalities.
Current VLMs show a substantial modality gap, favoring textual reasoning over visual reasoning.
Abstract
Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
