TL;DR
This paper introduces HalluScope, a benchmark to analyze hallucinations in LVLMs, and proposes HalluVL-DPO, a fine-tuning framework that reduces hallucinations caused by textual priors, improving model grounding.
Contribution
The paper presents a new benchmark for understanding hallucinations in LVLMs and a fine-tuning method to mitigate hallucinations induced by textual instructions.
Findings
Hallucinations mainly result from reliance on textual priors and background knowledge.
HalluVL-DPO effectively reduces hallucinations related to textual instruction priors.
The optimized model maintains or improves performance on other benchmarks.
Abstract
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
