Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Yuting Li; Lai Wei; Kaipeng Zheng; Jingyuan Huang; Guilin Li; Bo Wang; Linghe Kong; Lichao Sun; Weiran Huang

arXiv:2506.09736·cs.CV·September 30, 2025

Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Guilin Li, Bo Wang, Linghe Kong, Lichao Sun, Weiran Huang

PDF

1 Repo 2 Models 5 Datasets 3 Reviews

TL;DR

This paper reveals that current multimodal large language models rely heavily on captions and lack effective visual reasoning, proposing a simple visual perturbation framework that improves reasoning robustness without additional training.

Contribution

It introduces a novel visual perturbation framework that enhances multimodal reasoning performance without requiring retraining or complex modifications.

Findings

01

Consistent improvements in mathematical reasoning across datasets.

02

Visual perturbations contribute uniquely to reasoning aspects.

03

Competitive performance achieved with open-source models using perturbation.

Abstract

Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

1. The paper is generally clearly written but with some logical jump (please see Q2 in weakness). 2. I appreciate the authors' motivation in conducting analysis experiments in Section 3. Such experiments can help us understand where current MLLMs fall short, e.g., whether they cannot perceive the visual inputs well enough or it's their lack of reasoning capability, or even that their reasoning is not well grounded on the inputs (more in weakness Q1).

Weaknesses

1. The motivation in the experiments conducted in Figure 1 is interesting, but I feel the conclusion that "MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning" is not very well supported by the experiment setup. Specifically, Answer C has better performance than Answer B does not necessary imply MLLMs fail to integrate visual description during reasoning. It might be from the fact that the explicitly produced the caption may help ground models

Reviewer 02Rating 6Confidence 4

Strengths

The main strength of the paper lies in its motivating experiment that shows the key weaknesses they are targeting to. By showing strong evidence that current MLLMs don't reliably use visual information, they motivate the necessity of augmentation based training. The proposed algorithm uses conventional augmentation methods that visual perturbation and improve the model's performance across $4$ benchmarks. Furthermore, the authors conduct extensive study on which perturbation method affects each

Weaknesses

There are few questions regarding the experiment setting that I would like the authors to address. a) How does the fine-tuning of the vision tower affect the performance? Is the performance gain primarily because of the weakness in the vision tower? What happens if you freeze the vision tower or the language model during training? b) Can perturbations be applied at evaluation time to improve the model performance further? That is, one could apply different kinds of perturbations to the image

Reviewer 03Rating 6Confidence 5

Strengths

I find the following aspects of this work remarkable 1. The authors have clearly demonstrated their motivation. The lack-of-robustness issue of existing reasoning MLLMs is well explained through clear examples such as Table 1. 2. The design of the experiments, along with all the verifications to demonstrate the effectiveness of VP, are comprehensive. I appreciate the authors’ effort to cover all the corners for as much as possible.

Weaknesses

Still, I find several design flaws/loopholes with regard to VP. Out of the following two concerns, the first one is a major severe flaw that makes me question if the contribution of VP is genuine enough, especially if left unresolved. 1. **VP seems to introduce new problems, by rendering the original image unsolvable after perturbation.** Several more drastic perturbation strategies from VP, such as Random Crop 45%, may simply make original task unanswerable. For example, in Figure 3, after th

Code & Models

Repositories

yutingli0606/vision-matters
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDirect Preference Optimization · Shrink and Fine-Tune