Lions and Muons: Optimization via Stochastic Frank-Wolfe
Maria-Eleni Sfyraki, Jun-Kun Wang

TL;DR
This paper unifies recent optimizers Lion and Muon with the classical Stochastic Frank-Wolfe method, providing convergence guarantees and extending the framework to handle heavy-tailed noise in non-convex optimization.
Contribution
It shows Lion and Muon as special cases of Stochastic Frank-Wolfe and develops robust variants for heavy-tailed noise, with theoretical guarantees and practical improvements.
Findings
Lion and Muon are special instances of Stochastic Frank-Wolfe.
New robust variants handle heavy-tailed gradient noise with strong guarantees.
Convergence to stationarity implies convergence to KKT points under norm constraints.
Abstract
Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic…
Peer Reviews
Decision·Submitted to ICLR 2026
- The unified view of Lion and Moun as instances of a Stochastic FW method (Algorithm 3), and corresponding convergence results to KKT points, provide a good partial understanding of those methods. - Convergence results with high-probability are well positioned w.r.t. the state of the art (ignoring log factors)
- **Mismatch between the theoretical constrained setting and practical usage of Lion and Muon.** The paper models Lion and Muon as instances of Stochastic Frank-Wolfe over *bounded convex domains*. However, in their original formulations and practical implementations, **both optimizers operate over the unconstrained space** ($\mathbb{R}^d$ (or $\mathbb{R}^{m \times n}$) with a *soft* weight decay regularization term, rather than a *hard* norm constraint. Consequently, the theoretical equivalen
__1) Unification of Modern Optimizers and Classical Theory:__ The paper offers a compelling and elegant theoretical connection between popular optimizers (Lion and Muon) and the classical Stochastic Frank-Wolfe framework through explicit norm constraints. This synthesis is precisely formalized (Theorems 1 and 2) and supported by proofs in Appendix. __2) Theoretical Rigor:__ The work provides convergence guarantees for both classical variance-bounded and heavy-tailed stochastic regimes. The resu
__1) Relationship between Algorithms 1–3 and the role of parameters.__ Algorithms 1 and 2 are special cases of Algorithm 3 for particular choices of the sequences $\beta_{1, t}, \beta_{2, t}, \gamma_t$. I find the current parameterization unnecessarily complicated: a) $\beta_{1, t}$ appears time-invariant in the proofs. b) $\beta_{2, t}$ does not seem to be used in the main convergence theorems. c) $\gamma_t$ is constrained by $\gamma_t \leq 1 - \beta_{2, t} = 1 - \beta$, which again removes
The paper makes nice connection between two popular ML optimizers, Lion and Muon, with well studied Frank Wolf method. Thus provideding nice fundation to study the properties of these more recent algorithm, since it is possible to draw on a lot of existing theory. This is a nice connection, and as far as I can see, new. The paper supplies theoretical convergence guarantees (in terms of the Frank–Wolfe gap) for Lion and Muon in the non-convex, stochastic setting, including providing convergence
There could be better analsisis and comparison to the work exing convergence analysis of Lion and Muons. There is discussion both in the paper comparing to other work on Lion and Muons, but much of it is quite technical and difficult to read. There are no clear statements on the novelty and iprovments to these results, I mean in, e.g., in terms of better convergence rate or generalization in terms of simple structiors. Or a clear statement of lack of such improvements. It is unclear, if similar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Bandit Algorithms Research
MethodsEvolved Sign Momentum · Weight Decay
