Do Vision Language Models Need to Process Image Tokens?
Sambit Ghosh, R. Venkatesh Babu, Chirag Agarwal

TL;DR
This paper investigates whether deep processing of image tokens is necessary in vision language models, finding that visual representations stabilize early and that task complexity influences the need for sustained visual processing.
Contribution
It provides a systematic analysis showing visual representations stabilize quickly and questions the necessity of deep visual token processing in VLMs.
Findings
Visual representations rapidly converge to a stable state across layers.
Sustained visual processing is more critical for multi-token generation tasks.
Reducing visual depth affects reasoning trajectories more than final outputs.
Abstract
Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
