The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

Zhuowei Li; Haizhou Shi; Yunhe Gao; Di Liu; Zhenting Wang; Yuxiao Chen; Ting Liu; Long Zhao; Hao Wang; Dimitris N. Metaxas

arXiv:2502.03628·cs.CV·July 2, 2025

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper analyzes how large vision-language models hallucinate ungrounded content and introduces VISTA, a training-free inference method that significantly reduces hallucinations by leveraging internal token dynamics and early layer activations.

Contribution

The paper reveals internal token dynamics in LVLMs and proposes VISTA, a novel inference-time intervention that reduces hallucination without external supervision.

Findings

01

VISTA reduces hallucination by about 40% on average.

02

VISTA outperforms existing methods across multiple benchmarks.

03

VISTA is applicable to various decoding strategies and architectures.

Abstract

Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LzVv123456/VISTA
pytorchOfficial

Videos

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models Via Visual Information Steering· slideslive

Taxonomy

TopicsDigital Media Forensic Detection · Image Retrieval and Classification Techniques · Data Visualization and Analytics