MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Lianhai Ren; Yucheng Ding; Xiao Liu; Qianxiao Li; Peng Cheng; Yeyun Gong

arXiv:2602.01734·cs.LG·February 3, 2026

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Lianhai Ren, Yucheng Ding, Xiao Liu, Qianxiao Li, Peng Cheng, Yeyun Gong

PDF

Open Access

TL;DR

This paper introduces MSign, an optimizer designed to prevent training instability in large language models by restoring stable rank, thereby avoiding gradient explosions and improving training robustness.

Contribution

We propose MSign, a novel optimizer that periodically restores the stable rank of weight matrices to prevent training collapse in large language models.

Findings

01

MSign effectively prevents training failures across models from 5M to 3B parameters.

02

MSign incurs less than 7% additional computational overhead.

03

Theoretically, stable rank decline and Jacobian alignment cause exponential gradient growth.

Abstract

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via $μ$ P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Stochastic Gradient Optimization Techniques