NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

Hadi Mohaghegh Dolatabadi; Thalaiyasingam Ajanthan; Sameera Ramasinghe; Chamin P Hewa Koneputugodage; Shamane Siriwardhana; Violetta Shevchenko; Karol Pajak; James Snewin; Gil Avraham; and Alexander Long

arXiv:2603.03597·cs.LG·March 5, 2026

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, and Alexander Long

PDF

Open Access

TL;DR

NuMuon introduces a nuclear-norm constraint to the Muon optimizer, enhancing the low-rank structure of LLM weights, which improves compressibility and maintains training quality at large scales.

Contribution

This work reveals that Muon-trained models naturally exhibit low-rank structures and proposes NuMuon to explicitly enforce this, leading to better compression without sacrificing convergence.

Findings

01

NuMuon increases model weight compressibility.

02

NuMuon improves post-compression model quality.

03

Models trained with NuMuon retain Muon's convergence benefits.

Abstract

The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Big Data and Digital Economy · Stochastic Gradient Optimization Techniques