Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, Peter Richt\'arik

TL;DR
This paper introduces Gluon, a new LMO-based optimizer for large language models that bridges the gap between theoretical analysis and practical implementation, demonstrating improved convergence and empirical performance.
Contribution
Gluon is a novel LMO-based optimizer that incorporates layer-wise geometry and refined smoothness assumptions, aligning theory with practical layer-wise optimizer behavior.
Findings
Gluon achieves convergence guarantees with practical stepsizes.
Experiments on NanoGPT and CNN validate the layer-wise assumptions.
Gluon outperforms prior LMO-based optimizers in large-scale tasks.
Abstract
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as and . After over a decade of 's dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based method called , capturing prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
