Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Zhenglun Kong; Yize Li; Fanhu Zeng; Lei Xin; Shvat Messica; Xue Lin; Pu Zhao; Manolis Kellis; Hao Tang; Marinka Zitnik

arXiv:2505.18227·cs.LG·January 14, 2026

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

PDF

1 Repo

TL;DR

This paper advocates for rethinking token reduction in generative models as a fundamental principle that enhances multimodal integration, coherence, and training stability, beyond just improving efficiency.

Contribution

It redefines token reduction from an efficiency tool to a core principle that influences model architecture and applications across vision, language, and multimodal systems.

Findings

01

Token reduction facilitates deeper multimodal integration.

02

It helps mitigate hallucinations and overthinking.

03

Enhances coherence and training stability.

Abstract

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zlkong/awesome-token-compression-reduction
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing