Vision-centric Token Compression in Large Language Model

Ling Xing; Alex Jinpeng Wang; Rui Yan; Xiangbo Shu; Jinhui Tang

arXiv:2502.00791·cs.CL·December 12, 2025

Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, Jinhui Tang

PDF

Open Access

TL;DR

Vist is a novel vision-centric token compression framework for large language models that reduces computational costs by selectively converting distant tokens into images, maintaining accuracy while significantly decreasing token usage and resource consumption.

Contribution

The paper introduces Vist, a slow-fast compression framework that leverages vision encoders and a probability-informed visual enhancement to improve token efficiency in LLMs, outperforming existing methods.

Findings

01

Achieves same accuracy with 2.3x fewer tokens

02

Reduces FLOPs by 16% and memory by 50%

03

Outperforms CEPE by 7.6% on average across benchmarks

Abstract

Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsFocus