Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi; Sanghyeok Lee; Byungoh Ko; Eunseo Kim; Jihyung Kil; Hyunwoo J. Kim

arXiv:2508.00367·cs.CV·August 4, 2025

Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim

PDF

Open Access

TL;DR

Representation Shift introduces a training-free, model-agnostic metric that unifies token compression with FlashAttention, enabling faster transformer-based models without retraining or attention maps, applicable across various architectures.

Contribution

It proposes a novel, training-free token importance metric that integrates token compression with FlashAttention, compatible with multiple model types and improving efficiency.

Findings

01

Achieves up to 5.5% speedup in video-text retrieval.

02

Enables effective token compression without retraining.

03

Generalizes beyond Transformers to CNNs and state space models.

Abstract

Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices