Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou

TL;DR
This paper classifies infinite-depth neural network parametrizations, introduces Depth-μP for residual networks, and highlights the importance of feature diversity, showing how certain nonlinearities improve performance and discussing limitations in deep block networks.
Contribution
It extends the μP framework to depthwise parametrizations, introduces Depth-μP for residual networks, and analyzes feature diversity's role in deep learning performance.
Findings
Depth-μP enables depthwise hyperparameter transfer in residual networks.
Absolute value nonlinearities maximize feature diversity and improve performance.
Fundamental limitations exist for infinite-depth limits in deep block networks.
Abstract
By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-P that extends P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-P can be characterized as maximizing both feature learning and feature diversity. Exploiting this,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Tensor decomposition and applications
