VaLiD: Mitigating the Hallucination of Large Vision Language Models by   Visual Layer Fusion Contrastive Decoding

Jiaqi Wang; Yifei Gao; Jitao Sang

arXiv:2411.15839·cs.CV·March 18, 2025

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Jiaqi Wang, Yifei Gao, Jitao Sang

PDF

Open Access 1 Repo

TL;DR

VaLiD introduces a visual layer fusion contrastive decoding approach that mitigates hallucinations in large vision-language models by correcting visual encoding distortions, significantly improving the accuracy of generated content.

Contribution

The paper presents a novel visual encoding perspective and a contrastive decoding method to effectively reduce hallucinations in LVLMs, outperforming existing inference-time mitigation techniques.

Findings

01

VaLiD reduces hallucinations across multiple benchmarks.

02

It achieves state-of-the-art performance compared to baseline methods.

03

Visual layer fusion improves the reliability of model outputs.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage, typically attributing hallucinations to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning capabilities. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these insights, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RicardoLuL/VaLiD_LVLMs_hallucinations
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · COVID-19 diagnosis using AI · Image Processing Techniques and Applications