Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning
Yashwant Pravinrao Bangde, Debaditya Roy

TL;DR
This paper introduces IECD$^2$, a dual-stream decoding framework that improves grounded vision-language reasoning by balancing language expressiveness with visual evidence fidelity, reducing hallucinations.
Contribution
The paper proposes a novel contrastive dual-stream decoding method that adaptively fuses instruction-driven and evidence-driven token probabilities for better grounded reasoning.
Findings
IECD$^2$ improves task accuracy across multiple datasets.
The method reduces hallucinations in generated outputs.
Demonstrates consistent performance gains over state-of-the-art approaches.
Abstract
Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD), maintains two parallel probability distribution of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
