Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models
Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie

TL;DR
This paper introduces DeGF, a training-free decoding algorithm that uses text-to-image generative feedback to reduce hallucinations in large vision-language models, improving their reliability in multi-modal tasks.
Contribution
The paper proposes a novel self-correcting decoding method leveraging generative feedback from text-to-image models to mitigate hallucinations in LVLMs without additional training.
Findings
DeGF effectively reduces hallucinations across multiple benchmarks.
The approach surpasses state-of-the-art methods in hallucination mitigation.
It demonstrates robustness in diverse multi-modal tasks.
Abstract
While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding…
Peer Reviews
Decision·ICLR 2025 Poster
1. The idea of leveraging text-to-image generative models for LVLM hallucination mitigation is novel and interesting. 2. The paper is well-written and easy to follow. 3. Extensive experiments on multiple benchmarks demonstrate the effectiveness of the proposed method.
1. The effectiveness of the proposed method is heavily influenced by the quality and realism of the generated images. So the authors should perform more experiments and analysis on the quality of the generated images (both quantitative and qualitative, especially for long captions and real images). 2. Since the model relies on an addition diffusion model for inference, additional inference overhead should be discussed. 3. Experiments are limited to two LVLMs, experiments on more recent LVLMs (e.
1. Providing the generative feedback to mitigate hallucinations is straightforward and reasonable. Token-level refinement based on Jensen-Shannon divergence correctly utilize the generative feedback. 2. The proposed Self-Correcting Decoding with Generative Feedback (DeGF) also achieves pleasant results on POPE, CHAIR, MME, etc. 3. This paper is well-written and has clear figures. The experiments are extensive to some extent, and clearly organized.
1. The computation costs are unafforable for the LLM decoding strategy. Utilizing generative model like Stable Diffusion to provide generative feedback is unrealistic for practical deployment. Moreover, self-correcting decoding also consumes twice inference costs similar to contrastive decoding. 2. This approach utilizes extra pretrained network (i.e., Stable Diffusion). Baselines should contain methods that also employ extra analysis network like woodpecker [r1, r2], etc. Otherwise, it is unfai
Clear writing. Presentation and writing are clear and easy to follow. Well-motivated. A clear negative correlation between hallucination rates and CLIP similarities can be observed (Figure 3), which gives a strong empirical foundation of proposed decoding approach. Sufficient experiments. Authors do sufficient ablations, discussions to support their claims. Performance of proposed DeGF is quite good.
A concern originated from numerical hallucinations. Figure 2 presents an overview of proposed approach DeGF, with addressing numeric hallucinations as an example. A key premise for this method is that diffusion models can accurately perceive numbers. However, it seems a common observation that diffusion models fail to accurately interpret numbers [A]. Have authors considered the case when diffusion models fail to generate numerically accurate images? It is good to include analysis on tolerance o
Code & Models
Videos
Taxonomy
TopicsFractal and DNA sequence analysis · Topological and Geometric Data Analysis · EEG and Brain-Computer Interfaces
MethodsALIGN
