Almost sure convergence of stochastic Hamiltonian descent methods

M{\aa}ns Williamson; Tony Stillfjord

arXiv:2406.16649·math.OC·July 1, 2025

Almost sure convergence of stochastic Hamiltonian descent methods

M{\aa}ns Williamson, Tony Stillfjord

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that stochastic Hamiltonian descent methods, including gradient normalization and soft clipping, converge almost surely to stationary points under various noise and smoothness conditions, using dynamical systems theory.

Contribution

It provides a unified theoretical framework showing almost sure convergence of these methods for a broad class of stochastic optimization problems.

Findings

01

Convergence holds for L-smooth functions with possibly infinite variance.

02

Results extend to heavy-tailed noise with bounded variance.

03

Applicable in empirical risk minimization with possibly infinite variance but finite expectation.

Abstract

Gradient normalization and soft clipping are two popular techniques for tackling instability issues and improving convergence of stochastic gradient descent (SGD) with momentum. In this article, we study these types of methods through the lens of dissipative Hamiltonian systems. Gradient normalization and certain types of soft clipping algorithms can be seen as (stochastic) implicit-explicit Euler discretizations of dissipative Hamiltonian systems, where the kinetic energy function determines the type of clipping that is applied. We make use of dynamical systems theory to show in a unified way that all of these schemes converge to stationary points of the objective function, almost surely, in several different settings: a) for $L$ -smooth objective functions, when the variance of the stochastic gradients is possibly infinite, b) under the $(L_{0}, L_{1})$ -smoothness assumption, for…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 2

Strengths

The paper is written in an easy to understand way and the ideas flow smoothly between sections. On the technical side it has the following strengths: 1. Previous work has given guarantees that hold either in expectation or with high probability. These do not guarantee the convergence of every trajectory. However, the results presented in this paper can guarantee the convergence of almost all the trajectories i.e. there exists a set of initial points of measure zero whose trajectories are not gua

Weaknesses

The paper focuses primarily on proving almost sure convergence and does not provide any claims about convergence rate. The paper could benefit from showing convergence rate of the family of methods given by equation 9 under one of the settings.

Reviewer 02Rating 3Confidence 5

Strengths

- The paper is very well-written, and the authors rigorously verify the standard assumptions of the ODE approach, even if this means including assumptions that may seem idealized.

Weaknesses

The paper doesn’t bring substantial new insights to the field; it essentially revisits classical methods, line by line, and demonstrates the applicability of standard ODE-based techniques. While it is certainly rigorous and methodical, the paper ultimately falls short of deepening our understanding of the algorithms themselves or providing fresh perspectives on their behavior. This makes it a rather conventional contribution, adhering closely to established approaches without pushing beyond them

Reviewer 03Rating 6Confidence 4

Strengths

The paper is well written, structured, and thus easy to follow. It contributes to the theoretical understanding of SGD with momentum under gradient normalization and soft clipping. Although I didn't read every proof in extreme detail, I believe the formal arguments and results are correct. Minor detail: Since the authors initially present the formulation of the nearly Hamiltonian system in continuous time (see eq. (8)), perhaps the choice of the Lyapunov function in the proof of Theorem 5.7

Weaknesses

The main issue I have with the paper is that, because the proof strategy closely follows the approach developed by Kushner and Yin (2003) (cf. Section 5 and Theorem 5.2.1), it is not immediately clear why their proof cannot be directly invoked after showing that the sequence of iterates is finite almost surely. For example, Lemma A.3 can in fact be found in the first part of Theorem 5.2.1 but this is not explicitly mentioned by the authors. Could the authors explicitly discuss how their proof ex

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Stochastic processes and financial applications · Markov Chains and Monte Carlo Methods