Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang,, Yuxuan Gu, Weitao Ma, Yuan Xu, Bing Qin

TL;DR
This paper investigates how multimodal hallucinations in large vision-language models can snowball through interactions, leading to false claims, and proposes a training-free mitigation method called Residual Visual Decoding.
Contribution
It introduces MMHalSnowball, a framework to evaluate hallucination snowballing in LVLMs, and proposes a novel mitigation method that reduces hallucination effects without retraining.
Findings
LVLM performance drops by at least 31% due to hallucinations
The proposed method mitigates over 24% of hallucinations
LVLMs are prone to accepting and propagating generated hallucinations
Abstract
Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs' behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least , indicating that LVLMs are prone to accept the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMisinformation and Its Impacts · Data-Driven Disease Surveillance
