Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization
Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins

TL;DR
Stacey introduces an accelerated $\, ext{l}_p$-steepest descent algorithm tailored for non-Euclidean optimization in deep learning, offering theoretical guarantees and empirical improvements over traditional methods.
Contribution
The paper presents a novel accelerated $\, ext{l}_p$-steepest descent algorithm with primal-dual iterates, addressing non-Euclidean structures in deep network training.
Findings
Faster convergence compared to SGD, AdamW, and Lion.
Higher final accuracy in image classification and LLM pretraining.
Effectiveness of non-Euclidean approaches demonstrated across datasets.
Abstract
While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either or norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of across various models and datasets, underscoring the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data
