Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation
Franz A. Heinsen, Leo Kozachkov

TL;DR
This paper introduces a novel method for self-attention in Transformers that achieves constant computational cost per token, significantly reducing memory and energy requirements for large-scale models.
Contribution
It presents a symmetry-aware Taylor approximation for self-attention, enabling fixed-cost computation and scalable multi-head attention in Transformer models.
Findings
Achieves orders-of-magnitude reduction in memory and computation.
Enables unbounded token generation with fixed cost.
Validated correctness through empirical experiments.
Abstract
The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Machine Learning in Materials Science · Graph Theory and Algorithms
