Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective

Hiroshi Horii (SU); Sothea Has (KHM)

arXiv:2508.12834·stat.ML·August 19, 2025

Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective

Hiroshi Horii (SU), Sothea Has (KHM)

PDF

Open Access

TL;DR

This paper derives a mathematical criterion for optimal weight initialization variance in deep neural networks using a stochastic gradient descent dynamics perspective, leading to improved training outcomes.

Contribution

It provides a theoretical formula for choosing initialization variance based on SGD dynamics, validated by experiments on MNIST datasets.

Findings

01

Optimal initialization variance improves final training loss

02

Theoretical bounds match experimental results

03

Guides better weight initialization practices

Abstract

Stochastic gradient descent (SGD), one of the most fundamental optimization algorithms in machine learning (ML), can be recast through a continuous-time approximation as a Fokker-Planck equation for Langevin dynamics, a viewpoint that has motivated many theoretical studies. Within this framework, we study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence. As the quasi-steady-state distribution depends on the expected cost function, the KL divergence eventually reveals the connection between the expected cost function and the initialization distribution. By applying this to deep neural network models (DNNs), we can express the bounds of the expected loss function explicitly in terms of the initialization parameters. Then, by minimizing this bound, we obtain an optimal condition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Neural Networks and Applications