What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen

TL;DR
This paper introduces EmbedLens, a probing tool that reveals visual tokens in multimodal large language models are sparsely and redundantly encoded, with only a subset carrying meaningful image information, leading to insights for more efficient model design.
Contribution
The work uncovers the semantic sparsity and redundancy in visual tokens and proposes a framework for analyzing their internal processing in multimodal large language models.
Findings
Only about 60% of visual tokens carry image-specific meaning.
Most internal visual computations are redundant for standard tasks.
Alive tokens align with mid-layer representations, not initial embeddings.
Abstract
Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, , to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
