Understanding the Mechanisms of Fast Hyperparameter Transfer

Nikhil Ghosh; Denny Wu; Alberto Bietti

arXiv:2512.22768·cs.LG·December 30, 2025

Understanding the Mechanisms of Fast Hyperparameter Transfer

Nikhil Ghosh, Denny Wu, Alberto Bietti

PDF

Open Access 3 Reviews

TL;DR

This paper develops a theoretical framework to understand how hyperparameters can be transferred across different model scales efficiently, especially focusing on the role of problem structure and the Maximal Update Parameterization ($$P).

Contribution

It introduces a formal framework for analyzing hyperparameter transfer, clarifies conditions for fast transfer, and investigates the mechanisms behind $$P's effectiveness in large-scale models.

Findings

01

Fast transfer is asymptotically more compute-efficient than direct tuning.

02

Transfer success depends critically on problem structure and parameterization.

03

Empirical evidence supports the decomposition of optimization trajectories into width-stable and width-sensitive components.

Abstract

The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a general conceptual framework for reasoning about HP transfer across scale, characterizing transfer as fast when the suboptimality it induces vanishes asymptotically faster than the finite-scale performance gap. We show formally that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning that transfer is asymptotically more compute-efficient than direct tuning. While empirical work has found that the Maximal Update Parameterization ( $μ$ P) exhibits fast transfer when…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 1

Strengths

The paper is very well-written and easy to follow. I hardly see any typos. The motivation is well-presented and very clear. Though I am not really an expert in this field, I can follow the paper very easily. I think it is a good paper.

Weaknesses

First things first, since I am not an expert in this field, my opinion might be biased and not fairly evaluate the contribution of this paper. 1. One might argue that the paper only focuses on fast hyperparameter transfer in terms of width. In practice, one might want to know about fast hyperparameter transfer when scaling the depth instead. However, I would not consider it an issue, since it is the weakness of the Tensor Program series in general. Besides, I think that the setting is good eno

Reviewer 02Rating 4Confidence 2

Strengths

* Novel framework and formulation to try and explain the practical gains of an important theoretical finding in $\mu$Parameterization.

Weaknesses

* Hard to follow the notations in Section 4, especially with the incremental evolution of $\phi$ and its usage. * Hard to see the practical implications of the finding, especially with the requirement of adjusting $k$ for any new task. Unless the only goal was to *explain* why fast HP transfer shows up in practice, the claim in abstract for validating this for LLMs might need to be revised. * The understanding of the *main* paper on its own is a bit hard, with adequate parsing of the Appendix

Reviewer 03Rating 6Confidence 2

Strengths

* Understanding the mechanisms behind hyperparameter transfers from low to large scales is an important issue to study as it can inform future algorithmic directions * Assumptions are clearly stated * Existing literature on analysis of hyperparameter transfer across scales and related analysis methods is well covered

Weaknesses

* Unclear if there are any direct practical implications of the introduced conceptual framework * In line with previous work, the conceptual framework targets a very specific choice of hyperparameters and optimization settings. * Code to reproduce the experiments is not provided, although some details on the trainings and fixed hyperaparamter settings are provided. What are the specifics of the "Llama-style" transformer architecture you used?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms