Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
Ziqiao Wang, Yongyi Mao

TL;DR
This paper explores how stochastic differential equations (SDEs) can be used to understand the generalization behavior of stochastic gradient descent (SGD) by analyzing training trajectories and terminal states through an information-theoretic lens.
Contribution
It introduces novel generalization bounds for SGD by approximating its dynamics with SDEs, using information-theoretic methods for both trajectory and terminal state analyses.
Findings
Trajectory-based bounds outperform previous results.
Terminal-state bounds decay rapidly, similar to stability bounds.
SDE approximation effectively captures SGD's generalization behavior.
Abstract
Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Gaussian Processes and Bayesian Inference · Machine Learning and ELM
MethodsStochastic Gradient Descent
