Multi-Layer Transformers Gradient Can be Approximated in Almost Linear   Time

Yingyu Liang; Zhizhou Sha; Zhenmei Shi; Zhao Song; Yufa Zhou

arXiv:2408.13233·cs.LG·October 16, 2024·2 cites

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou

PDF

Open Access

TL;DR

This paper introduces a novel method to approximate gradients in multi-layer transformers in nearly linear time, significantly reducing computational costs for long input sequences while maintaining small approximation errors.

Contribution

It presents a theoretical framework for fast gradient approximation in transformers, applicable to complex models with residuals, masking, and multi-head attention, enabling more efficient training.

Findings

01

Gradient computation time is reduced to almost linear in sequence length.

02

The approximation maintains a polynomially small error across the model.

03

Applicable to general loss functions and practical transformer sub-modules.

Abstract

The computational complexity of the self-attention mechanism in popular transformer architectures poses significant challenges for training and inference, and becomes the bottleneck for long inputs. Is it possible to significantly reduce the quadratic time complexity of computing the gradients in multi-layer transformer models? This paper proves that a novel fast approximation method can calculate the gradients in almost linear time $n^{1 + o (1)}$ where $n$ is the input sequence length, while it maintains a polynomially small approximation error $1/ poly (n)$ across the entire model. Our theory holds for general loss functions and when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation, we hope that this work will facilitate more effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum optics and atomic interactions · Magneto-Optical Properties and Applications · Neural Networks and Reservoir Computing