Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

Ziqing Wen; Zhouyang Liu; Jiahuan Wang; Ping Luo; Li Shen; Dongsheng Li; Tao Sun

arXiv:2605.05794·cs.LG·May 8, 2026

Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

Ziqing Wen, Zhouyang Liu, Jiahuan Wang, Ping Luo, Li Shen, Dongsheng Li, Tao Sun

PDF

TL;DR

This paper introduces MoLS, a method that uses signal-to-noise ratio estimates to automatically scale learning rates across modules in LLMs, improving training efficiency and performance.

Contribution

The paper proposes MoLS, a novel module-wise learning rate scaling method based on SNR, addressing heterogeneity in LLM training without manual tuning.

Findings

01

MoLS improves convergence speed in LLM training benchmarks.

02

MoLS achieves performance comparable to manual module-specific learning rates.

03

MoLS is compatible with memory-efficient training algorithms.

Abstract

The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) provide per-parameter adaptivity, they do not explicitly account for module-level gradient heterogeneity, resulting in slower convergence, suboptimal performance, or training instability. Existing approaches typically rely on manually tuned module-specific learning rates or specific optimization strategies, which are computationally costly and difficult to generalize across tasks or models. To establish a more principled approach, we first analyze the noise-damping behavior of Adam in high-noise modules and introduce \textbf{Module-wise Learning Rate Scaling via SNR (MoLS)}. MoLS estimates module-level SNRs to scale Adam updates,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.