Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Zijian Liu; Zhengyuan Zhou

arXiv:2312.08531·cs.LG·March 20, 2026·1 cites

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Zijian Liu, Zhengyuan Zhou

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper provides a unified analysis of the last-iterate convergence of stochastic gradient methods, extending results to broader settings including non-Euclidean norms, composite objectives, and heavy-tailed noise.

Contribution

It introduces a comprehensive framework for proving last-iterate convergence rates of SGD under general conditions, overcoming previous limitations such as bounded noise and compact domains.

Findings

01

Unified convergence analysis for general domains and objectives

02

Extension to non-Euclidean norms and composite optimization

03

Convergence guarantees under heavy-tailed and sub-Weibull noise

Abstract

In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O (lo g (1/ δ) lo g T / T)$ or $O (lo g (1/ δ) / T)$ high-probability convergence rates for the final iterate, where T is the time horizon and \delta is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noise. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

As implicitly stated in my "Summary", I do think that the goal of the paper is interesting. A result involving high probability guarantees for the last iterate of SGD for convex problems is interesting in my opinion. The proofs are comprehensive and mostly carefully written (even though there are readability issues I will expand on in the later parts of my review). Having a unified result is also nice to cover different important settings.

Weaknesses

Even though I think the main result of high probability rates for last iterate of SGD without bounded domains is interesting and worthy of acceptance (once the correctness is verified) there are many issues with the writing of the paper and proofs that should also be addressed which prevented me to be able to verify the correctness. Right now, the repeating theme in the paper is for the authors to spend way too much time and effort to show their improvements in marginal cases, which confuses rea

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

The work presents high-probability and in expectation convergence result for the last iterate of SGD in general domains for convex or strongly convex objectives

Weaknesses

-In find the work incremental compared to the litterature. In fact, the work generelises the convergence results of the last iterate SGD for convex or strongly convex objectives to the general domains not necessarly compact. I find the content of the paper more adapted to be publisehd in a math/optimisation journal than ICLR.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

This work makes a solid contribution to the understanding of the last-iterate convergence of stochastic gradient methods, which is an important problem is convex optimization and particularly gains interest from the ML community, since in practice the theoretically sub-optimal choice of the last iterate is cheaper and thus more popular. The technical results are general and cover a wide range of settings, bypassing a few constraints of previous works including assumptions on compact domain and

Weaknesses

I do not find any substantial weakness. One constraint of this work is that it lacks a proof sketch or discussion on the main idea in the main-text. I believe the paper can benefit from adding a simplest example of $z_t$, explaining how it's used to utilize convexity. Doing so can improve the readability by giving the readers more intuitions on the design of $z_t$ and how it works.

Videos

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

MethodsStochastic Gradient Descent