Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan; Yizheng Wu; Jiashen Hua; Junyi Feng; Shaotian Yan; Bing Deng; Zhiguo Cao; Jieping Ye

arXiv:2603.02556·cs.CV·March 4, 2026

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye

PDF

Open Access 3 Reviews

TL;DR

This paper introduces VC-STaR, a self-improving framework for visual reasoning in VLMs that leverages visual contrast to reduce hallucinations, creating a new dataset and enhancing model performance.

Contribution

The paper proposes a novel contrastive self-taught reasoning approach for VLMs, introducing a new dataset and outperforming existing methods in visual reasoning tasks.

Findings

01

VC-STaR outperforms existing self-improving methods.

02

The approach surpasses state-of-the-art visual reasoning datasets.

03

Visual contrast enhances reasoning accuracy in VLMs.

Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets,…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 5

Strengths

1) This paper proposes a task-agnostic, three-stage pipeline to construct contrastive pairs, with principled text/vision similarity thresholds to ensure both semantic anchoring and visual proximity. 2) The paper takes an insightful contrastive perspective, showing that learning from visual contrasts effectively reduces hallucinations and strengthens reasoning accuracy in VLMs. 3) Analyses isolate the impact of curation strategies, sample difficulty, and pair types, producing actionable design gu

Weaknesses

1) In Table 1, the method fails to achieve top results on MMStar. While it improves over its backbone, it still lags behind several baselines, and the paper does not explain why its gains do not generalize to this benchmark. 2) In line 258, the authors state that only median-difficulty contrastive VQA pairs are retained for rationale generation. However, the paper does not discuss why hard samples are excluded or how they might be addressed. Exploring strategies for handling these challenging ca

Reviewer 02Rating 6Confidence 4

Strengths

1. The ideas presented are interesting, and the proposed approach has potential to spark further research. 2. The article is well-structured and easy to read, providing a smooth reading experience.

Weaknesses

1. Unclear Motivation for Using Input Comparison to Solve the Hallucination Problem: While the paper proposes using input comparison to mitigate hallucination issues, the motivation or theoretical basis for this approach is not clearly explained. Could the authors elaborate on why this method is effective from a cognitive perspective? For instance, are there relevant studies from cognitive models or psychological theories that can support the effectiveness of this approach? 2. The Problem Being

Reviewer 03Rating 6Confidence 3

Strengths

- The proposed framework works in different VQA tasks, including math reasoning, general tasks, and hallucinations. - The SFT dataset construction process is intuitive. - A multi-step prompting strategy for contrasting and rethinking is proposed to properly leverage the “reference information” offered by the contrasted image pairs.

Weaknesses

- The definition of “self-improving” is not clear to me. The collected public datasets for selecting samples to construct the proposed dataset is quite large (21 datasets). Does evaluating a model fine-tuned in 21 datasets in 5 downstream benchmark datasets really count as self-improving? It’s more a strategy to carefully select external knowledge instead of “self-improving.” - If I understand correctly, after the model is tuned on the curated dataset consisting of contrastive pairs, the infere

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Generative Adversarial Networks and Image Synthesis