Gradient Descent's Last Iterate is Often (slightly) Suboptimal
Guy Kornowski, Ohad Shamir

TL;DR
This paper proves that for convex Lipschitz optimization, the last iterate of gradient descent and stochastic gradient descent cannot achieve the optimal convergence rate without prior knowledge of the total number of steps, due to inherent limitations.
Contribution
It confirms Jain et al.'s conjecture that no universal stepsize schedule can guarantee optimal last iterate convergence without knowing the total number of iterations in advance.
Findings
Last iterate convergence rate is at best $ ilde{O}(1/\sqrt{T})$ without prior knowledge of T.
Even in noiseless gradient descent, an extra poly-logarithmic factor in T is unavoidable.
Any adaptive stopping time leads to suboptimal convergence guarantees.
Abstract
We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of after steps. A breakthrough result of Jain et al. [2019] recovered the optimal rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of , no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
