Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks
Shenxi Wu, Haosong Zhang, Xingjian Ma, Shirui Bian, Yichi Zhang, Xi Chen, Wei Lin

TL;DR
This paper introduces a universal law for how optimal learning rates should decay with effective depth in non-recurrent multi-path neural networks, enabling reliable zero-shot transfer of hyperparameters across architectures.
Contribution
It unifies the understanding of depth scaling in diverse architectures using a graph-based effective depth and establishes a universal -3/2 power law for learning rate decay under maximal-update criteria.
Findings
Optimal learning rate decays with effective depth following a -3/2 power law.
Experiments confirm the predicted slope across various architectures.
Enables reliable zero-shot transfer of learning rates across depths and widths.
Abstract
Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization (P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
