TL;DR
This paper introduces ReVisiT, a training-free decoding method for LVLMs that enhances visual semantic integration during text generation by referencing vision tokens, improving performance and efficiency.
Contribution
ReVisiT is a novel, training-free decoding approach that leverages vision tokens to guide LVLMs, improving visual semantic utilization without additional training.
Findings
ReVisiT achieves competitive or superior results on five benchmarks.
ReVisiT reduces computational cost by up to 2x.
Vision tokens encode meaningful visual semantics even during hallucinations.
Abstract
Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
