Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction
Ziyang Wu, Tianjiao Ding, Yifu Lu, Druv Pai, Jingyuan Zhang, Weida, Wang, Yaodong Yu, Yi Ma, Benjamin D. Haeffele

TL;DR
This paper introduces Token Statistics Self-Attention (TSSA), a linear-time attention mechanism for transformers that replaces pairwise similarity computations, achieving comparable performance with improved efficiency across vision and language tasks.
Contribution
The paper presents a novel linear-time attention module derived from a variational form of the MCR$^2$ objective, enabling more efficient transformer architectures.
Findings
TSSA achieves linear computational complexity with comparable accuracy.
Replacing standard attention with TSSA reduces computational cost significantly.
TSSA challenges the necessity of pairwise similarity in transformer success.
Abstract
The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant computational burden, with the computational complexity scaling quadratically with the number of tokens. In this work, we propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. We derive our network architecture by extending prior work which has shown that a transformer style architecture naturally arises by "white-box" architecture design, where each layer of the network is designed to implement an incremental optimization step of a maximal coding rate reduction objective (MCR). Specifically, we derive a novel variational form of the MCR objective and show that the architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Image and Signal Denoising Methods · Digital Media Forensic Detection
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adam
