TL;DR
u-$$P combines Maximal Update Parametrization with Unit Scaling to create a training scheme that is size-independent, easy to train in low precision, and enables efficient hyperparameter tuning.
Contribution
The paper introduces u-$$P, a novel scheme that integrates $$P with Unit Scaling, simplifying training and hyperparameter optimization across model sizes.
Findings
u-$$P models achieve equal or lower loss than $$P models.
u-$$P works effectively in FP8 precision.
The scheme simplifies hyperparameter sweeping and training procedures.
Abstract
The Maximal Update Parametrization (P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-P, which improves upon P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-P models reaching a loss that is equal to or lower than comparable P models and working out-of-the-box in FP8.
Peer Reviews
Decision·ICLR 2025 Spotlight
I am very positive about this paper -- I think it is both a valuable contribution and an important direction for future work, as it is both practically and theoretically motivated and I agree with the concept of unit scale. I also appreciate the demonstration of failed HP transfer of muP for typical Llama-like models, which I have experienced myself. The experiments are very broad and consider not only HP transfer, but dependence between parameters, numerical properties during training, FP8 and
While I am an advocate for the paper, I want to raise some points/irregularities that came to my mind while reading and I think would need to be addressed, both to improve the work, its insights or my score. They concern, in particular, the experimental setups: - Why not use a larger dataset for the HP transfer experiments? I understand it is to compare to the setup of Yang et al., but I am asking because it is really small compared to modern training settings. For instance, in Fig. 4, 34k step
I think this is important research direction that has many practical values. The paper contains a lot of useful details (many of them are in the appendix). I really appreciate that. The embedding scaling rule seems interesting and novel but also controversal.
- Transfer across batch, depth, training steps is not convincing (Fig. 4). - (important) The learning rate in the embedding seems unnatural and contradicts the original mup paper. In the infinite-width setting, the update will go to zero, and the input layer is frozen; this doesn't seem right to me. - (Important) Everett also studies hyperparameter transfer thoroughly and is highly related to this paper. Please make a more comprehensive comparison. In particular, the mean-field parameterizatio
The paper is well-written and the idea is clearly delivered. The authors have conducted extensive experiments to compare the performance of the proposed methods with the original $\mu P$.
* As the authors discussed, this work lacks a comparison with other proposed methods (e.g., Large et al., 2024). * There is no theoretical justification of why choosing a different scaling for the embedding learning rate.
Code & Models
- 🤗Aleph-Alpha/umup-research-7b-bf16model
- 🤗Aleph-Alpha/umup-research-7b-fp8model· ♡ 3♡ 3
- 🤗Aleph-Alpha/sp-baseline-research-7b-bf16model
- 🤗Aleph-Alpha/sp-baseline-research-1b-bf16model
- 🤗Aleph-Alpha/umup-research-1b-bf16model
- 🤗Aleph-Alpha/umup-research-1b-fp8model
- 🤗Aleph-Alpha/sp-baseline-research-3b-bf16model
- 🤗Aleph-Alpha/umup-research-3b-bf16model
- 🤗Aleph-Alpha/umup-research-3b-fp8model
- 🤗IlPakoZ/RNA-BERTa9700model· 5 dl· ♡ 25 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
