Token Statistics Transformer: Linear-Time Attention via Variational Rate   Reduction

Ziyang Wu; Tianjiao Ding; Yifu Lu; Druv Pai; Jingyuan Zhang; Weida; Wang; Yaodong Yu; Yi Ma; Benjamin D. Haeffele

arXiv:2412.17810·cs.LG·December 24, 2024·5 cites

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Ziyang Wu, Tianjiao Ding, Yifu Lu, Druv Pai, Jingyuan Zhang, Weida, Wang, Yaodong Yu, Yi Ma, Benjamin D. Haeffele

PDF

Open Access 1 Repo

TL;DR

This paper introduces Token Statistics Self-Attention (TSSA), a linear-time attention mechanism for transformers that replaces pairwise similarity computations, achieving comparable performance with improved efficiency across vision and language tasks.

Contribution

The paper presents a novel linear-time attention module derived from a variational form of the MCR$^2$ objective, enabling more efficient transformer architectures.

Findings

01

TSSA achieves linear computational complexity with comparable accuracy.

02

Replacing standard attention with TSSA reduces computational cost significantly.

03

TSSA challenges the necessity of pairwise similarity in transformer success.

Abstract

The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant computational burden, with the computational complexity scaling quadratically with the number of tokens. In this work, we propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. We derive our network architecture by extending prior work which has shown that a transformer style architecture naturally arises by "white-box" architecture design, where each layer of the network is designed to implement an incremental optimization step of a maximal coding rate reduction objective (MCR $^{2}$ ). Specifically, we derive a novel variational form of the MCR $^{2}$ objective and show that the architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

robinwu218/tost
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Image and Signal Denoising Methods · Digital Media Forensic Detection

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adam