Representative Attention For Vision Transformers
Yuntong Li, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

TL;DR
This paper introduces RPAttention, a linear attention mechanism for Vision Transformers that dynamically forms representative tokens in semantic space, enabling efficient global context modeling with linear complexity.
Contribution
It proposes a novel representation-driven token compression method that adapts to content structure, improving efficiency and global information exchange in Vision Transformers.
Findings
RPAttention reduces token interaction complexity from quadratic to linear.
It maintains expressive global context modeling across vision tasks.
Experiments show improved performance on classification, detection, segmentation.
Abstract
Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
