Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

Qihuang Zhong; Liang Ding; Wenjie Xuan; Juhua Liu; Bo Du; Dacheng Tao

arXiv:2605.11931·cs.CV·May 13, 2026

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao

PDF

TL;DR

VISTA is a vision-aware self-improvement training framework that enhances multimodal reasoning in large language models by addressing data imbalance and language bias, leading to significant performance gains.

Contribution

The paper introduces VISTA, a novel training method that leverages visual cues and a prefix resampling strategy to improve reasoning in multimodal models.

Findings

01

VISTA improves reasoning performance across various MLLMs and tasks.

02

Up to +13.66% average performance gains achieved with VISTA.

03

VISTA effectively addresses data imbalance and language prior bias.

Abstract

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.