Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho; Jaehyung Kim

arXiv:2506.09522·cs.CV·May 14, 2026

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho, Jaehyung Kim

PDF

1 Repo

TL;DR

This paper introduces ReVisiT, a training-free decoding method for LVLMs that enhances visual semantic integration during text generation by referencing vision tokens, improving performance and efficiency.

Contribution

ReVisiT is a novel, training-free decoding approach that leverages vision tokens to guide LVLMs, improving visual semantic utilization without additional training.

Findings

01

ReVisiT achieves competitive or superior results on five benchmarks.

02

ReVisiT reduces computational cost by up to 2x.

03

Vision tokens encode meaningful visual semantics even during hallucinations.

Abstract

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bscho333/ReVisiT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.