Better LMO-based Momentum Methods with Second-Order Information
Sarit Khirirat, Abdurakhmon Sadiev, Yury Demidovich, Peter Richt\'arik

TL;DR
This paper introduces an improved LMO-based momentum method that incorporates second-order information, achieving faster convergence rates and better adaptability to different problem geometries in stochastic optimization.
Contribution
It extends the LMO framework by integrating Hessian-Corrected Momentum, providing convergence guarantees in arbitrary norms with an improved rate of O(1/K^{1/3}).
Findings
Achieves convergence rate of O(1/K^{1/3}) with HCM.
Demonstrates faster training of neural networks.
Validates theoretical results with experiments on MLPs and LSTMs.
Abstract
The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework--leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the rate. While several approaches--such as Hessian-Corrected Momentum (HCM)--have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under…
Peer Reviews
Decision·Submitted to ICLR 2026
* Strong theoretical contribution that raises the convergence rate of LMO-based methods from O(1/K^{1/4}) to O(1/K^{1/3}). * Analysis covers arbitrary norms and relaxed smoothness, making the results broadly applicable to deep learning settings. * Two well-motivated algorithmic variants, with and without Hessian smoothness, clarify trade-offs in assumptions. * Experiments show consistent gains across tasks and norms, matching the theoretical expectations. * Clear connection between theory and ex
* Experiments are small-scale and do not test large or modern models where Hessian-vector products may become costly. * No wall-clock or computational cost analysis to justify practical efficiency compared to first-order baselines. * Missing comparisons with strong modern baselines like STORM, MARS, or adaptive optimizers such as Adam. * Theoretical analysis focuses on βₖ = 1 − αₖ, while experiments use different βₖ values without full explanation or theoretical support. * Limited discussion of
This paper extends the study of LMO-type optimizers by incorporating second-order Hessian-corrected momentum. It provides convergence analysis and proves that LMO-type optimizer with second-order momentum achieves the optimal $O(1/K^{1/3})$ rate under second-order smoothness. Moreover, empirical experiments also show that the proposed optimizer has better performance compared to other baselines.
There are two main concerns overall: - I find the discussion related to variance reduction algorithms and the Mars scaling factor confusing and deviated from the main part of this paper. From my understanding, variance reduction algorithms are vastly different from Hessian-corrected momentum. While this paper seems to focus on the latter, the former seems unrelated. Moreover, it is unclear to me what's the role of the scaling factor $\beta_t / (1-\alpha_t)$ in the proposed Algorithm 1 and relat
1. Establishes $O(K^{-1/3})$ convergence for LMO-based momentum with arbitrary norms and relaxed smoothness, improving on the $O(K^{-1/4})$ bound for LMO+Polyak momentum and aligning with best-known second-order momentum rates in Euclidean settings (Theorems 1-2). 2. Well-structured presentation: Algorithm 1 is easy to implement, assumptions are grouped and referenced, and Table 1 positions the results against prior LMO and second-order momentum rates; figures make the geometry ($\ell_2$ vs $\el
1. Empirical validation is modest in scope and scale: MLP on a 1k-sample dataset and a PTB LSTM, with plots primarily of training loss and gradient norm; there are no validation/test metrics (e.g., perplexity), runtime, or wall-clock/throughput comparisons to quantify the extra cost of Hessian-vector products. 2. The paper cites that HVPs are “roughly the same time as computing the gradient,” but does not measure this in practice; thus the compute-efficiency trade-off of the proposed methods r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Quantum Computing Algorithms and Architecture
