SUS backprop: linear backpropagation algorithm for long inputs in transformers
Sergey Pankov, Georges Harik

TL;DR
This paper introduces SUS backprop, a probabilistic method to cut most attention gradient flow in transformers, reducing backpropagation complexity from quadratic to linear for long sequences with minimal impact on gradient variance.
Contribution
It proposes a simple probabilistic rule to sparsify attention backpropagation, significantly reducing computational cost for long sequences in transformers.
Findings
Reduces attention backpropagation complexity from O(n^2) to O(nc).
Maintains minimal gradient variance increase (~1%) with high sparsity.
Effective for long sequence training with sparse matrix implementation.
Abstract
It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of backpropagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length . At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Complexity and Algorithms in Graphs · Parallel Computing and Optimization Techniques
MethodsSoftmax · Attention Is All You Need
