What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

Yingqi Fan; Junlong Tong; Anhao Zhao; Xiaoyu Shen

arXiv:2603.00510·cs.CV·March 3, 2026

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen

PDF

Open Access

TL;DR

This paper introduces EmbedLens, a probing tool that reveals visual tokens in multimodal large language models are sparsely and redundantly encoded, with only a subset carrying meaningful image information, leading to insights for more efficient model design.

Contribution

The work uncovers the semantic sparsity and redundancy in visual tokens and proposes a framework for analyzing their internal processing in multimodal large language models.

Findings

01

Only about 60% of visual tokens carry image-specific meaning.

02

Most internal visual computations are redundant for standard tasks.

03

Alive tokens align with mid-layer representations, not initial embeddings.

Abstract

Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $EmbedLens$ , to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx 60%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning