Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
Bruno Mlodozeniec, Pierre Ablin, Louis B\'ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi

TL;DR
This paper introduces a unified parameterisation for scaling and transferring hyperparameters across model modules, sizes, batch, and duration, enabling efficient hyperparameter transfer and improved training speed in large models.
Contribution
It extends previous hyperparameter transfer methods by unifying scaling axes and enabling per-module hyperparameter transfer with practical guidelines.
Findings
Hyperparameter transfer is effective across modules and scales.
Significant training speed improvements in large language models.
Unified parameterisation simplifies high-dimensional hyperparameter optimization.
Abstract
Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical…
Peer Reviews
Decision·ICLR 2026 Poster
* How to set hyperparameters for large models is a crucial question, and investigating per-module hyperparameter optimization on the lower scale is an important idea. * 27% speedup in the large scale training is a substantial improvement and emprirical evidence of per-module HP transfer being useful is important to a larger audience. * The most related work is closely discussed throughout the paper and ideas well placed in the literature * The paper is mostly well written I did not closely chec
* The paper makes a point about Bayesian optimization being an ill fit for the problem setting at many points, but does not include an actual empricial comparison. * Saying "Our study covers all optimisation hyperparameters of modern models" in the abstract is too bold of a statement. This already breaks down when other optimizers or learning rate schedules are considered, or, e.g., the choice of optimizer becomes a hyperparameter. I suggest making a slight adjustment here. * Experimental rigor
1) The extension to per-module hyperparameters represents a meaningful advancement beyond prior work on global hyperparameter transfer. 2) Novel weight decay scaling rule for batch size (κ scaling) derived from SDE analysis is theoretically motivated and empirically validated 3) Comprehensive experimental coverage across multiple scaling dimensions (width, depth, batch size, token horizon). Empirical results show a speedup improvement of up to 27%.
1) Model sizes are relatively limited - while the sizes of the models used in this paper are comparable to CompleteP, µP was evaluated on larger models. Addressing the question of scalability would have strengthened the guildelines proposed by the authors. 2) The evaluation does not include the model's performance on downstream tasks. CompleteP, for example, includes this kind of evaluation. Given that this paper builds directly on said paper, not including this in the evaluation reduces the re
- Extending the state of the art (Depth-μP) to include more recent architectural advances such as QK-normalization makes the proposed hyperparameter transfer framework significantly more practical and up to date. - Addressing hyperparameter transfer across batch size and token budget is both sensible and novel, as this dimension of scaling has received limited attention in prior work.
- Clarity and notation: Certain sections of the paper are difficult to follow, with inconsistent or undefined symbols. For instance, the variable θ denotes model weights in Section 2 but appears to represent hyperparameters in Section 3. The second paragraph of Section 3.1 in particular is confusing and requires clearer explanations and consistent notation. - Questionable argument against existing HPO methods: The paper claims that established hyperparameter optimization techniques (e.g., Bayes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
