Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

Franz A. Heinsen; Leo Kozachkov

arXiv:2602.00294·cs.LG·February 3, 2026

Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

Franz A. Heinsen, Leo Kozachkov

PDF

Open Access

TL;DR

This paper introduces a novel method for self-attention in Transformers that achieves constant computational cost per token, significantly reducing memory and energy requirements for large-scale models.

Contribution

It presents a symmetry-aware Taylor approximation for self-attention, enabling fixed-cost computation and scalable multi-head attention in Transformer models.

Findings

01

Achieves orders-of-magnitude reduction in memory and computation.

02

Enables unbounded token generation with fixed cost.

03

Validated correctness through empirical experiments.

Abstract

The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Machine Learning in Materials Science · Graph Theory and Algorithms