SUS backprop: linear backpropagation algorithm for long inputs in transformers

Sergey Pankov; Georges Harik

arXiv:2505.15080·cs.LG·June 6, 2025

SUS backprop: linear backpropagation algorithm for long inputs in transformers

Sergey Pankov, Georges Harik

PDF

Open Access

TL;DR

This paper introduces SUS backprop, a probabilistic method to cut most attention gradient flow in transformers, reducing backpropagation complexity from quadratic to linear for long sequences with minimal impact on gradient variance.

Contribution

It proposes a simple probabilistic rule to sparsify attention backpropagation, significantly reducing computational cost for long sequences in transformers.

Findings

01

Reduces attention backpropagation complexity from O(n^2) to O(nc).

02

Maintains minimal gradient variance increase (~1%) with high sparsity.

03

Effective for long sequence training with sparse matrix implementation.

Abstract

It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of backpropagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length $n$ . At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Complexity and Algorithms in Graphs · Parallel Computing and Optimization Techniques

MethodsSoftmax · Attention Is All You Need