Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models
Haoran Zhou, Zihan Zhang, Hao Chen

TL;DR
This paper introduces EVA, a training-free, model-agnostic method that extracts and leverages visual factual knowledge from intermediate layers of multimodal large language models to significantly reduce object hallucinations.
Contribution
EVA is a novel, training-free approach that dynamically selects and utilizes intermediate-layer visual facts to mitigate hallucinations in MLLMs, applicable across various models and decoding strategies.
Findings
EVA significantly reduces hallucination rates in MLLMs.
EVA is compatible with different decoding strategies.
EVA demonstrates effectiveness across multiple benchmarks.
Abstract
Multimodal Large Language Models (MLLMs) have made significant strides by combining visual recognition and language understanding to generate content that is both coherent and contextually accurate. However, MLLMs continue to struggle with object hallucinations, where models produce seemingly plausible but factually incorrect outputs, including objects that do not exist in the image. Recent work has revealed that the prior knowledge in MLLMs significantly suppresses visual information in deep layers, causing hallucinatory outputs. However, how these priors suppress visual information at the intermediate layer stage in MLLMs remains unclear. We observe that visual factual knowledge and the differences between intermediate-layer prior/original probability distributions show similar evolutionary trends in intermediate layers. Motivated by this, we introduce Decoding by Extracting Visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
