V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

Nan Sun; Zhenyu Zhang; Xixun Lin; Kun Wang; Yanmin Shang; Naibin Gu; Shuohuan Wang; Yu Sun; Hua Wu; Haifeng Wang; Yanan Cao

arXiv:2512.03542·cs.CV·December 4, 2025

V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

Nan Sun, Zhenyu Zhang, Xixun Lin, Kun Wang, Yanmin Shang, Naibin Gu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, Yanan Cao

PDF

Open Access

TL;DR

This paper introduces V-ITI, a novel framework that detects visual neglect in multimodal large language models and intervenes at inference time to reduce hallucinations, improving reliability without sacrificing performance.

Contribution

V-ITI is the first approach to detect visual neglect via head-level activation patterns and selectively intervene, effectively mitigating hallucinations in MLLMs during inference.

Findings

01

V-ITI reduces hallucinations across eight benchmarks.

02

The framework maintains task performance while decreasing hallucinations.

03

It is applicable to various MLLM architectures.

Abstract

Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Face Recognition and Perception