One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi

TL;DR
This paper introduces a unified, training-free framework that manipulates vision tokens in latent space to reduce hallucinations in multimodal large language models, balancing visual enhancement and bias correction.
Contribution
It proposes a novel unified approach using vision token manipulation with two modules, SVC and CRC, to effectively mitigate hallucinations without additional training.
Findings
Reduces object hallucinations significantly
Improves POPE accuracy by 2% on LLaVA-1.5
Maintains low inference latency overhead
Abstract
Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Hallucinations in medical conditions
