See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs

Ziyun Dai; Xiaoqiang Li; Shaohua Zhang; Yuanchen Wu; Jide Li

arXiv:2507.22003·cs.CV·July 31, 2025

See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs

Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li

PDF

TL;DR

This paper introduces ViHallu, a visual-centric framework that uses visual variations and instructions to improve fine-grained visual understanding in LVLMs, significantly reducing hallucinations and enhancing visual-semantic alignment.

Contribution

The paper proposes a novel visual variation generation method and visual instruction construction to mitigate hallucinations in LVLMs, focusing on visual-semantic alignment improvements.

Findings

01

ViHallu reduces hallucination tendencies in LVLMs.

02

Enhanced fine-grained visual understanding demonstrated on multiple benchmarks.

03

Release of ViHallu-Instruction dataset for hallucination mitigation.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces visual variation images with controllable visual alterations while maintaining the overall image structure. These images,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.