Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye

TL;DR
This paper introduces VC-STaR, a self-improving framework for visual reasoning in VLMs that leverages visual contrast to reduce hallucinations, creating a new dataset and enhancing model performance.
Contribution
The paper proposes a novel contrastive self-taught reasoning approach for VLMs, introducing a new dataset and outperforming existing methods in visual reasoning tasks.
Findings
VC-STaR outperforms existing self-improving methods.
The approach surpasses state-of-the-art visual reasoning datasets.
Visual contrast enhances reasoning accuracy in VLMs.
Abstract
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets,…
Peer Reviews
Decision·ICLR 2026 Oral
1) This paper proposes a task-agnostic, three-stage pipeline to construct contrastive pairs, with principled text/vision similarity thresholds to ensure both semantic anchoring and visual proximity. 2) The paper takes an insightful contrastive perspective, showing that learning from visual contrasts effectively reduces hallucinations and strengthens reasoning accuracy in VLMs. 3) Analyses isolate the impact of curation strategies, sample difficulty, and pair types, producing actionable design gu
1) In Table 1, the method fails to achieve top results on MMStar. While it improves over its backbone, it still lags behind several baselines, and the paper does not explain why its gains do not generalize to this benchmark. 2) In line 258, the authors state that only median-difficulty contrastive VQA pairs are retained for rationale generation. However, the paper does not discuss why hard samples are excluded or how they might be addressed. Exploring strategies for handling these challenging ca
1. The ideas presented are interesting, and the proposed approach has potential to spark further research. 2. The article is well-structured and easy to read, providing a smooth reading experience.
1. Unclear Motivation for Using Input Comparison to Solve the Hallucination Problem: While the paper proposes using input comparison to mitigate hallucination issues, the motivation or theoretical basis for this approach is not clearly explained. Could the authors elaborate on why this method is effective from a cognitive perspective? For instance, are there relevant studies from cognitive models or psychological theories that can support the effectiveness of this approach? 2. The Problem Being
- The proposed framework works in different VQA tasks, including math reasoning, general tasks, and hallucinations. - The SFT dataset construction process is intuitive. - A multi-step prompting strategy for contrasting and rethinking is proposed to properly leverage the “reference information” offered by the contrasted image pairs.
- The definition of “self-improving” is not clear to me. The collected public datasets for selecting samples to construct the proposed dataset is quite large (21 datasets). Does evaluating a model fine-tuned in 21 datasets in 5 downstream benchmark datasets really count as self-improving? It’s more a strategy to carefully select external knowledge instead of “self-improving.” - If I understand correctly, after the model is tuned on the curated dataset consisting of contrastive pairs, the infere
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Generative Adversarial Networks and Image Synthesis
