Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation
Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Guilin Li, Bo Wang, Linghe Kong, Lichao Sun, Weiran Huang

TL;DR
This paper reveals that current multimodal large language models rely heavily on captions and lack effective visual reasoning, proposing a simple visual perturbation framework that improves reasoning robustness without additional training.
Contribution
It introduces a novel visual perturbation framework that enhances multimodal reasoning performance without requiring retraining or complex modifications.
Findings
Consistent improvements in mathematical reasoning across datasets.
Visual perturbations contribute uniquely to reasoning aspects.
Competitive performance achieved with open-source models using perturbation.
Abstract
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is generally clearly written but with some logical jump (please see Q2 in weakness). 2. I appreciate the authors' motivation in conducting analysis experiments in Section 3. Such experiments can help us understand where current MLLMs fall short, e.g., whether they cannot perceive the visual inputs well enough or it's their lack of reasoning capability, or even that their reasoning is not well grounded on the inputs (more in weakness Q1).
1. The motivation in the experiments conducted in Figure 1 is interesting, but I feel the conclusion that "MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning" is not very well supported by the experiment setup. Specifically, Answer C has better performance than Answer B does not necessary imply MLLMs fail to integrate visual description during reasoning. It might be from the fact that the explicitly produced the caption may help ground models
The main strength of the paper lies in its motivating experiment that shows the key weaknesses they are targeting to. By showing strong evidence that current MLLMs don't reliably use visual information, they motivate the necessity of augmentation based training. The proposed algorithm uses conventional augmentation methods that visual perturbation and improve the model's performance across $4$ benchmarks. Furthermore, the authors conduct extensive study on which perturbation method affects each
There are few questions regarding the experiment setting that I would like the authors to address. a) How does the fine-tuning of the vision tower affect the performance? Is the performance gain primarily because of the weakness in the vision tower? What happens if you freeze the vision tower or the language model during training? b) Can perturbations be applied at evaluation time to improve the model performance further? That is, one could apply different kinds of perturbations to the image
I find the following aspects of this work remarkable 1. The authors have clearly demonstrated their motivation. The lack-of-robustness issue of existing reasoning MLLMs is well explained through clear examples such as Table 1. 2. The design of the experiments, along with all the verifications to demonstrate the effectiveness of VP, are comprehensive. I appreciate the authors’ effort to cover all the corners for as much as possible.
Still, I find several design flaws/loopholes with regard to VP. Out of the following two concerns, the first one is a major severe flaw that makes me question if the contribution of VP is genuine enough, especially if left unresolved. 1. **VP seems to introduce new problems, by rendering the original image unsolvable after perturbation.** Several more drastic perturbation strategies from VP, such as Random Crop 45%, may simply make original task unanswerable. For example, in Figure 3, after th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDirect Preference Optimization · Shrink and Fine-Tune
