Representation Shift: Unifying Token Compression with FlashAttention
Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim

TL;DR
Representation Shift introduces a training-free, model-agnostic metric that unifies token compression with FlashAttention, enabling faster transformer-based models without retraining or attention maps, applicable across various architectures.
Contribution
It proposes a novel, training-free token importance metric that integrates token compression with FlashAttention, compatible with multiple model types and improving efficiency.
Findings
Achieves up to 5.5% speedup in video-text retrieval.
Enables effective token compression without retraining.
Generalizes beyond Transformers to CNNs and state space models.
Abstract
Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices
