Extending $\mu$P: Spectral Conditions for Feature Learning Across Optimizers
Akshita Gupta, Marieme Ngom, Sam Foreman, Venkatram Vishwanath

TL;DR
This paper introduces a spectral condition framework to extend the maximal update parameterization ($$P) to various optimizers, enabling hyperparameter transferability across model sizes and improving large-scale training efficiency.
Contribution
The paper develops a novel spectral condition approach to derive $$P for multiple optimizers, facilitating hyperparameter transfer and scaling insights for large language models.
Findings
Zero-shot learning rate transfer across model widths.
Effective $$P application to optimizers like AdamW, LAMB, and Shampoo.
Empirical validation on benchmark models.
Abstract
Several variations of adaptive first-order and second-order optimization methods have been proposed to accelerate and scale the training of large language models. The performance of these optimization routines is highly sensitive to the choice of hyperparameters (HPs), which are computationally expensive to tune for large-scale models. Maximal update parameterization P is a set of scaling rules which aims to make the optimal HPs independent of the model size, thereby allowing the HPs tuned on a smaller (computationally cheaper) model to be transferred to train a larger, target model. Despite promising results for SGD and Adam, deriving P for other optimizers is challenging because the underlying tensor programming approach is difficult to grasp. Building on recent work that introduced spectral conditions as an alternative to tensor programs, we propose a novel framework to…
Peer Reviews
Decision·Submitted to ICLR 2026
1. A clear, interpretable derivation that produces width-invariant HPs (esp. LR) for multiple optimizers, broadening μP’s practical coverage with lighter machinery than Tensor Programs. 2. Closed-form LR scalings plus explicit forward scaling and (O(1)) treatment for LN/bias—usable as a “cookbook.” 3. Multiple widths trained per model family: NanoGPT (128→2048) and LLaMA-2–style (256→2048 ≈154M→1.38B params on WikiText-103). LR–vs–loss sweeps show the same LR tuned on the smallest widt
1.The analysis repackages μP using the published spectral condition and **retains μP’s assumptions**. Under the same assumptions, Tensor Programs could in principle obtain the same optimizer scalings. A genuine advance would relax assumptions or prove depth-scaling for the added optimizers. 2.Core derivations (Result 4.1) use a **linear MLP, batch-1** (rank-1 gradients where spectral≈Frobenius). Transformer attention is not newly analyzed—assumptions are imported and only **validated empiri
1. The proposed framework largely simplifies tensor programs, and the results are clearly presented. The derivations are simple and applicable for a range of optimizers. 2. Beyond recovering the parameterization for AdamW, the paper provides new parameterizations for LAMB and Shampoo.
1. Muon optimizer is mentioned in the introduction, but there is no corresponding derivation for it. See Question 2. 2. The derivations rely on strong simplifications, such as batch size=1, $\beta_1=\beta_2=\epsilon=0$, and the dropping of weight decay. It is questionable whether the derived exponents remain valid when these hyperparameters are changed. 3. The results for Shampoo (Figure 2) do not show a clear zero-shot LR transfer. The losses often worsen with width across the LR grid. See Q
1. Addresses a High-Impact Problem: The computational cost of HP tuning is a significant bottleneck in training large models. $\mu$P is a powerful tool to mitigate this, but its applicability has been limited. Extending $\mu$P to a wider, more modern set of optimizers like LAMB, Sophia, and Shampoo is a valuable and practical contribution to the field. 2. Clear Practical Takeaways: The paper delivers actionable scaling rules for several optimizers (summarized in Table 2). 3. Strong Empirical Val
1. Incremental Novelty: The core conceptual leap, replacing complex tensor programs with more tractable spectral conditions, was introduced by prior work. This paper's main contribution is the application of this existing spectral framework to new optimizers. While this is a useful engineering and analytical contribution, the intellectual novelty of the methodology itself is limited. 2. Repetitive and Padded Structure: The main methodological idea is presented in Section 4.1 as a Generic Framewo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Topic Modeling
