Thermodynamic Natural Gradient Descent
Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson,, Gavin Crooks, Patrick J. Coles

TL;DR
This paper introduces a hybrid digital-analog approach to natural gradient descent that leverages thermodynamic properties of analog systems, enabling efficient second-order neural network training with comparable complexity to first-order methods.
Contribution
It proposes a novel hybrid algorithm combining digital and analog computing for NGD, reducing computational costs and exploiting thermodynamic properties for neural network training.
Findings
Superiority over state-of-the-art digital methods in classification tasks
Effective fine-tuning of language models demonstrated
Analog thermodynamic system enables efficient second-order optimization
Abstract
Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The presented approach seems to be compatible with alternative (to GPUs) hardware, although I’m not familiar with thermodynamic computing to judge it properly. 2. The proposed natural gradient approximation is theoretically justified.
## Evaluation issues While the paper’s approach is interesting, the evaluations could be significantly improved. First, MNIST performance. Fig.3b shows that 1. Adam and TNGD achieve the same test error; (Line 367 claims “TNGD outperforms Adam”, and then line 371 claims TNGD achieves better test accuracy, while it is clearly not true in Fig3b.) 2. TNGD overfits significantly more than Adamw (0 train error vs. somewhat close to test error for Adam); (Line 368 claims TNGD “generalizes better”.)
This paper proposes a new algorithm that combines analog and digital computation to reduce the per-iteration cost of NGD, and performed thorough comparison of the scaling of the runtime between their method and various existing methods. The motivation and the method are clearly conveyed. The authors also performed several numerical experiments comparing the existing methods and TNGD.
The paper could be written more clearly and concisely. I find the introduction of certain aspects too detailed and not necessarily helpful for understanding the authors’ contribution (such as the Fisher information etc.) and certain aspects too unclear for people not coming from an optimization background (such as the introduction of fast matrix vector product, the convergence speed and performance is only vaguely discussed). Also, this paper uses many different notations, and the introduction o
I think this paper is very timely, and tackles an important topic. Using hardware (in this case, fundamental physics) to effectively compute will be essential to deep learning going forward. The paper is for the most part nicely written, and straightforward to follow. The experiments are convincing.
There are some terminological concerns/questions (below), which would improve the paper if addressed.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWelding Techniques and Residual Stresses · Radiative Heat Transfer Studies
MethodsNatural Gradient Descent
