Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Shenxi Wu; Haosong Zhang; Xingjian Ma; Shirui Bian; Yichi Zhang; Xi Chen; Wei Lin

arXiv:2602.07494·cs.LG·February 10, 2026

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Shenxi Wu, Haosong Zhang, Xingjian Ma, Shirui Bian, Yichi Zhang, Xi Chen, Wei Lin

PDF

Open Access

TL;DR

This paper introduces a universal law for how optimal learning rates should decay with effective depth in non-recurrent multi-path neural networks, enabling reliable zero-shot transfer of hyperparameters across architectures.

Contribution

It unifies the understanding of depth scaling in diverse architectures using a graph-based effective depth and establishes a universal -3/2 power law for learning rate decay under maximal-update criteria.

Findings

01

Optimal learning rate decays with effective depth following a -3/2 power law.

02

Experiments confirm the predicted slope across various architectures.

03

Enables reliable zero-shot transfer of learning rates across depths and widths.

Abstract

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ( $μ$ P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks