Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
Hiroshi Horii (SU), Sothea Has (KHM)

TL;DR
This paper derives a mathematical criterion for optimal weight initialization variance in deep neural networks using a stochastic gradient descent dynamics perspective, leading to improved training outcomes.
Contribution
It provides a theoretical formula for choosing initialization variance based on SGD dynamics, validated by experiments on MNIST datasets.
Findings
Optimal initialization variance improves final training loss
Theoretical bounds match experimental results
Guides better weight initialization practices
Abstract
Stochastic gradient descent (SGD), one of the most fundamental optimization algorithms in machine learning (ML), can be recast through a continuous-time approximation as a Fokker-Planck equation for Langevin dynamics, a viewpoint that has motivated many theoretical studies. Within this framework, we study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence. As the quasi-steady-state distribution depends on the expected cost function, the KL divergence eventually reveals the connection between the expected cost function and the initialization distribution. By applying this to deep neural network models (DNNs), we can express the bounds of the expected loss function explicitly in terms of the initialization parameters. Then, by minimizing this bound, we obtain an optimal condition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Neural Networks and Applications
