Speed Limits for Deep Learning
Inbar Seroussi, Alexander A. Alemi, Moritz Helias, Zohar Ringel

TL;DR
This paper applies stochastic thermodynamics to establish theoretical speed limits for training neural networks, revealing that under certain conditions, training approaches optimal efficiency, supported by experiments on CNNs and FCNs.
Contribution
It introduces a novel thermodynamic framework to bound neural network training speed, providing analytical expressions and insights into optimal training regimes.
Findings
Training speed is bounded by thermodynamic principles.
Neural networks can operate near optimal training efficiency.
Experiments confirm the existence of non-optimal and optimal training regimes.
Abstract
State-of-the-art neural networks require extreme computational power to train. It is therefore natural to wonder whether they are optimally trained. Here we apply a recent advancement in stochastic thermodynamics which allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network, based on the ratio of their Wasserstein-2 distance and the entropy production rate of the dynamical process connecting them. Considering both gradient-flow and Langevin training dynamics, we provide analytical expressions for these speed limits for linear and linearizable neural networks e.g. Neural Tangent Kernel (NTK). Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense. Our results are consistent with small-scale experiments with…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper introduces an innovative perspective by leveraging the speed limit concept from optimal transport theory to shed light on the training efficiency in deep learning. Given the importance and challenge of characterizing the training speed and efficiency of neural networks, new theoretical insights in this topic are timely and have the potential for significant impact.
However, I have significant concerns about the scope and significance of the results, as well as the validity of certain claims. Additionally, I believe there is ample room for substantial improvement in the writing and the clarity of the presentation. **On Significance** 1. The concept of 'optimal training speed' in the paper is highly specific to the introduced framework and appears to diverge from its conventional interpretation in the optimization literature. Unlike traditional contexts
## Originality To my knowledge, this work is original. But I am not a specialist of statistical physics applied to NNs. ## Clarity The authors made the effort to make their paper understandable to the reader who would not be a specialist in statistical physics applied to NNs. Overall, the paper is easy to read. ## Quality The experimental section, despite being narrow (only one setting has been tested), provides enough results to evaluate the significance and the limitation of the theoretic
## Significance ### Narrowness of the theoretical setting Only two setups have been studied: linear regression and NNs in the NTK regime. Moreover, the continuous-time SGD does not model faithfully the discrete SGD when training practical NNs on realistic data. Moreover, the authors does not discuss how $T_{SL}$ (lower bound on the training time) obtained in the NTK regime compare to a hypothetical $T_{SL}$ obtained in finite-width NNs. Would it be larger of smaller? ... ### Motivation Give
I think this is a technically sound paper that makes good contributions. The application of stochastic thermodynamics concepts to neural network training is novel. I have not seen it before in prior literature. The analysis done in the paper is insightful and to the best of my knowledge seems mathematically rigorous. The results on optimal scaling efficiency are intriguing. The writing is clear and relatively compact. It seems like the relevant prior works cited correctly. Moreover this paper pr
It is unclear if the near optimal scaling efficiency result applies to large realistic models and datasets. The CIFAR-10 study used very small networks. It would be very nice to see an empirical example with a larger scope. Further, it would be nice to be more explicit of how much of the entropy production is due to the presence of nonzero initial weights. It would be nice to cover the case of either the perceptron or the NTK starting with $\theta_0 = 0$. It is rather unclear whether there a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis · Gaussian Processes and Bayesian Inference
MethodsNeural Tangent Kernel · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
