Mitigating Hallucination in Visual Language Models with Visual Supervision
Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao,, Jinqiao Wang, Ming Tang

TL;DR
This paper proposes methods to reduce hallucination in vision-language models by incorporating detailed visual annotations, auxiliary supervision, and a new benchmark for evaluation, leading to more accurate multi-modal responses.
Contribution
It introduces detailed relationship annotations, integrates SAM and mask prediction as supervision, and develops RAH-Bench for comprehensive hallucination evaluation.
Findings
8.4% improvement over LLaVA in hallucination mitigation
Enhanced accuracy in detailed image understanding
Widespread performance gains across multiple models
Abstract
Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsSegment Anything Model
