The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas

TL;DR
This paper analyzes how large vision-language models hallucinate ungrounded content and introduces VISTA, a training-free inference method that significantly reduces hallucinations by leveraging internal token dynamics and early layer activations.
Contribution
The paper reveals internal token dynamics in LVLMs and proposes VISTA, a novel inference-time intervention that reduces hallucination without external supervision.
Findings
VISTA reduces hallucination by about 40% on average.
VISTA outperforms existing methods across multiple benchmarks.
VISTA is applicable to various decoding strategies and architectures.
Abstract
Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Media Forensic Detection · Image Retrieval and Classification Techniques · Data Visualization and Analytics
