Finite-Time Analysis of Gradient Descent for Shallow Transformers

Enes Arda; Semih Cayci; Atilla Eryilmaz

arXiv:2601.16514·cs.LG·April 3, 2026

Finite-Time Analysis of Gradient Descent for Shallow Transformers

Enes Arda, Semih Cayci, Atilla Eryilmaz

PDF

TL;DR

This paper provides a theoretical analysis of shallow Transformers trained by gradient descent, showing their width scales logarithmically with data size and their optimization error is unaffected by sequence length, unlike recurrent models.

Contribution

It offers the first finite-time analysis of shallow Transformers in the kernel regime, highlighting their efficiency and robustness compared to recurrent architectures.

Findings

01

Width for guarantees scales logarithmically with sample size

02

Optimization error is independent of sequence length

03

Transformers outperform recurrent models in autoregressive tasks

Abstract

Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$ , and (ii) the optimization error is independent of the sequence length $T$ . This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$ . The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and compare Transformers with recurrent architectures on an autoregressive task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.