Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao

TL;DR
VISTA is a vision-aware self-improvement training framework that enhances multimodal reasoning in large language models by addressing data imbalance and language bias, leading to significant performance gains.
Contribution
The paper introduces VISTA, a novel training method that leverages visual cues and a prefix resampling strategy to improve reasoning in multimodal models.
Findings
VISTA improves reasoning performance across various MLLMs and tasks.
Up to +13.66% average performance gains achieved with VISTA.
VISTA effectively addresses data imbalance and language prior bias.
Abstract
Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
