Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Guy Kornowski; Ohad Shamir

arXiv:2604.13870·math.OC·April 16, 2026

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Guy Kornowski, Ohad Shamir

PDF

TL;DR

This paper proves that for convex Lipschitz optimization, the last iterate of gradient descent and stochastic gradient descent cannot achieve the optimal convergence rate without prior knowledge of the total number of steps, due to inherent limitations.

Contribution

It confirms Jain et al.'s conjecture that no universal stepsize schedule can guarantee optimal last iterate convergence without knowing the total number of iterations in advance.

Findings

01

Last iterate convergence rate is at best $ ilde{O}(1/\sqrt{T})$ without prior knowledge of T.

02

Even in noiseless gradient descent, an extra poly-logarithmic factor in T is unavoidable.

03

Any adaptive stopping time leads to suboptimal convergence guarantees.

Abstract

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $lo g T / T$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/ T$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$ , no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.