Mitigating Hallucination in Visual Language Models with Visual   Supervision

Zhiyang Chen; Yousong Zhu; Yufei Zhan; Zhaowen Li; Chaoyang Zhao,; Jinqiao Wang; Ming Tang

arXiv:2311.16479·cs.CV·November 29, 2023·2 cites

Mitigating Hallucination in Visual Language Models with Visual Supervision

Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao,, Jinqiao Wang, Ming Tang

PDF

Open Access

TL;DR

This paper proposes methods to reduce hallucination in vision-language models by incorporating detailed visual annotations, auxiliary supervision, and a new benchmark for evaluation, leading to more accurate multi-modal responses.

Contribution

It introduces detailed relationship annotations, integrates SAM and mask prediction as supervision, and develops RAH-Bench for comprehensive hallucination evaluation.

Findings

01

8.4% improvement over LLaVA in hallucination mitigation

02

Enhanced accuracy in detailed image understanding

03

Widespread performance gains across multiple models

Abstract

Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications

MethodsSegment Anything Model