Context-Aware Decoding for Faithful Vision-Language Generation
Mehrdad Fazli, Bowen Wei, Ziwei Zhu

TL;DR
This paper investigates the layer-wise generation process in vision-language models to understand hallucinations and introduces a training-free method, CEI, that reduces hallucinations by using context embeddings as grounding signals during decoding.
Contribution
The paper uncovers the commitment-depth gap in LVLMs and proposes CEI, a lightweight, training-free technique that significantly reduces hallucinations in vision-language generation.
Findings
CEI outperforms state-of-the-art baselines on multiple benchmarks.
Dynamic CEI achieves the lowest hallucination rates.
Layer-wise analysis reveals early commitment of truthful tokens.
Abstract
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
