Lions and Muons: Optimization via Stochastic Frank-Wolfe

Maria-Eleni Sfyraki; Jun-Kun Wang

arXiv:2506.04192·math.OC·February 3, 2026

Lions and Muons: Optimization via Stochastic Frank-Wolfe

Maria-Eleni Sfyraki, Jun-Kun Wang

PDF

Open Access 3 Reviews

TL;DR

This paper unifies recent optimizers Lion and Muon with the classical Stochastic Frank-Wolfe method, providing convergence guarantees and extending the framework to handle heavy-tailed noise in non-convex optimization.

Contribution

It shows Lion and Muon as special cases of Stochastic Frank-Wolfe and develops robust variants for heavy-tailed noise, with theoretical guarantees and practical improvements.

Findings

01

Lion and Muon are special instances of Stochastic Frank-Wolfe.

02

New robust variants handle heavy-tailed gradient noise with strong guarantees.

03

Convergence to stationarity implies convergence to KKT points under norm constraints.

Abstract

Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The unified view of Lion and Moun as instances of a Stochastic FW method (Algorithm 3), and corresponding convergence results to KKT points, provide a good partial understanding of those methods. - Convergence results with high-probability are well positioned w.r.t. the state of the art (ignoring log factors)

Weaknesses

- **Mismatch between the theoretical constrained setting and practical usage of Lion and Muon.** The paper models Lion and Muon as instances of Stochastic Frank-Wolfe over *bounded convex domains*. However, in their original formulations and practical implementations, **both optimizers operate over the unconstrained space** ($\mathbb{R}^d$ (or $\mathbb{R}^{m \times n}$) with a *soft* weight decay regularization term, rather than a *hard* norm constraint. Consequently, the theoretical equivalen

Reviewer 02Rating 4Confidence 4

Strengths

__1) Unification of Modern Optimizers and Classical Theory:__ The paper offers a compelling and elegant theoretical connection between popular optimizers (Lion and Muon) and the classical Stochastic Frank-Wolfe framework through explicit norm constraints. This synthesis is precisely formalized (Theorems 1 and 2) and supported by proofs in Appendix. __2) Theoretical Rigor:__ The work provides convergence guarantees for both classical variance-bounded and heavy-tailed stochastic regimes. The resu

Weaknesses

__1) Relationship between Algorithms 1–3 and the role of parameters.__ Algorithms 1 and 2 are special cases of Algorithm 3 for particular choices of the sequences $\beta_{1, t}, \beta_{2, t}, \gamma_t$. I find the current parameterization unnecessarily complicated: a) $\beta_{1, t}$ appears time-invariant in the proofs. b) $\beta_{2, t}$ does not seem to be used in the main convergence theorems. c) $\gamma_t$ is constrained by $\gamma_t \leq 1 - \beta_{2, t} = 1 - \beta$, which again removes

Reviewer 03Rating 6Confidence 2

Strengths

The paper makes nice connection between two popular ML optimizers, Lion and Muon, with well studied Frank Wolf method. Thus provideding nice fundation to study the properties of these more recent algorithm, since it is possible to draw on a lot of existing theory. This is a nice connection, and as far as I can see, new. The paper supplies theoretical convergence guarantees (in terms of the Frank–Wolfe gap) for Lion and Muon in the non-convex, stochastic setting, including providing convergence

Weaknesses

There could be better analsisis and comparison to the work exing convergence analysis of Lion and Muons. There is discussion both in the paper comparing to other work on Lion and Muons, but much of it is quite technical and difficult to read. There are no clear statements on the novelty and iprovments to these results, I mean in, e.g., in terms of better convergence rate or generalization in terms of simple structiors. Or a clear statement of lack of such improvements. It is unclear, if similar

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Bandit Algorithms Research

MethodsEvolved Sign Momentum · Weight Decay