ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu; Xu Han; Yan Sun; Viresh Pati; Yubin Kim; Siddhartha Somani; Shihao Yang

arXiv:2602.05230·cs.LG·February 6, 2026

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang

PDF

Open Access

TL;DR

ZeroS introduces a novel linear attention mechanism that overcomes fundamental limitations of existing methods, enabling more expressive and stable attention with maintained efficiency, and achieves comparable or better results on sequence benchmarks.

Contribution

ZeroS removes the zero-order term and reweights residuals in linear attention, expanding representational capacity and improving stability while preserving $O(N)$ complexity.

Findings

01

ZeroS matches or exceeds softmax attention on benchmarks.

02

It enables contrastive operations within a single attention layer.

03

ZeroS maintains linear complexity with enhanced expressiveness.

Abstract

Linear attention methods offer Transformers $O (N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/ t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O (N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks