Vision-centric Token Compression in Large Language Model
Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, Jinhui Tang

TL;DR
Vist is a novel vision-centric token compression framework for large language models that reduces computational costs by selectively converting distant tokens into images, maintaining accuracy while significantly decreasing token usage and resource consumption.
Contribution
The paper introduces Vist, a slow-fast compression framework that leverages vision encoders and a probability-informed visual enhancement to improve token efficiency in LLMs, outperforming existing methods.
Findings
Achieves same accuracy with 2.3x fewer tokens
Reduces FLOPs by 16% and memory by 50%
Outperforms CEPE by 7.6% on average across benchmarks
Abstract
Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsFocus
