Finite-Time Analysis of Gradient Descent for Shallow Transformers
Enes Arda, Semih Cayci, Atilla Eryilmaz

TL;DR
This paper provides a theoretical analysis of shallow Transformers trained by gradient descent, showing their width scales logarithmically with data size and their optimization error is unaffected by sequence length, unlike recurrent models.
Contribution
It offers the first finite-time analysis of shallow Transformers in the kernel regime, highlighting their efficiency and robustness compared to recurrent architectures.
Findings
Width for guarantees scales logarithmically with sample size
Optimization error is independent of sequence length
Transformers outperform recurrent models in autoregressive tasks
Abstract
Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size , and (ii) the optimization error is independent of the sequence length . This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with . The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and compare Transformers with recurrent architectures on an autoregressive task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
