u-$\mu$P: The Unit-Scaled Maximal Update Parametrization

Charlie Blake; Constantin Eichenberg; Josef Dean; Lukas Balles; Luke; Y. Prince; Bj\"orn Deiseroth; Andres Felipe Cruz-Salinas; Carlo Luschi,; Samuel Weinbach; Douglas Orr

arXiv:2407.17465·cs.LG·January 13, 2025

u-$\mu$P: The Unit-Scaled Maximal Update Parametrization

Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke, Y. Prince, Bj\"orn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi,, Samuel Weinbach, Douglas Orr

PDF

1 Repo 10 Models 3 Reviews

TL;DR

u-$$P combines Maximal Update Parametrization with Unit Scaling to create a training scheme that is size-independent, easy to train in low precision, and enables efficient hyperparameter tuning.

Contribution

The paper introduces u-$$P, a novel scheme that integrates $$P with Unit Scaling, simplifying training and hyperparameter optimization across model sizes.

Findings

01

u-$$P models achieve equal or lower loss than $$P models.

02

u-$$P works effectively in FP8 precision.

03

The scheme simplifies hyperparameter sweeping and training procedures.

Abstract

The Maximal Update Parametrization ( $μ$ P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u- $μ$ P, which improves upon $μ$ P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$ P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u- $μ$ P models reaching a loss that is equal to or lower than comparable $μ$ P models and working out-of-the-box in FP8.

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

I am very positive about this paper -- I think it is both a valuable contribution and an important direction for future work, as it is both practically and theoretically motivated and I agree with the concept of unit scale. I also appreciate the demonstration of failed HP transfer of muP for typical Llama-like models, which I have experienced myself. The experiments are very broad and consider not only HP transfer, but dependence between parameters, numerical properties during training, FP8 and

Weaknesses

While I am an advocate for the paper, I want to raise some points/irregularities that came to my mind while reading and I think would need to be addressed, both to improve the work, its insights or my score. They concern, in particular, the experimental setups: - Why not use a larger dataset for the HP transfer experiments? I understand it is to compare to the setup of Yang et al., but I am asking because it is really small compared to modern training settings. For instance, in Fig. 4, 34k step

Reviewer 02Rating 6Confidence 4

Strengths

I think this is important research direction that has many practical values. The paper contains a lot of useful details (many of them are in the appendix). I really appreciate that. The embedding scaling rule seems interesting and novel but also controversal.

Weaknesses

- Transfer across batch, depth, training steps is not convincing (Fig. 4). - (important) The learning rate in the embedding seems unnatural and contradicts the original mup paper. In the infinite-width setting, the update will go to zero, and the input layer is frozen; this doesn't seem right to me. - (Important) Everett also studies hyperparameter transfer thoroughly and is highly related to this paper. Please make a more comprehensive comparison. In particular, the mean-field parameterizatio

Reviewer 03Rating 8Confidence 3

Strengths

The paper is well-written and the idea is clearly delivered. The authors have conducted extensive experiments to compare the performance of the proposed methods with the original $\mu P$.

Weaknesses

* As the authors discussed, this work lacks a comparison with other proposed methods (e.g., Large et al., 2024). * There is no theoretical justification of why choosing a different scaling for the embedding learning rate.

Code & Models

Repositories

graphcore-research/unit-scaling
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.