Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Tim Tsz-Kit Lau; Weijie Su

arXiv:2605.18106·math.OC·May 19, 2026

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Tim Tsz-Kit Lau, Weijie Su

PDF

1 Repo

TL;DR

This paper introduces a symmetry-compatible principle for optimizer design in deep learning, aligning optimizer equivariance with neural network symmetries, leading to improved training stability and validation loss.

Contribution

It develops a unified framework for symmetry-aware optimizers tailored to various matrix parameter classes, extending beyond orthogonal groups to permutation and shift symmetries.

Findings

01

Symmetry-compatible optimizers improve validation loss across multiple language models.

02

Experiments show enhanced training stability with the proposed methods.

03

The approach unifies and generalizes existing equivariant optimization techniques.

Abstract

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

timlautk/equivariant_optimizers
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.