Two Facets of SDE Under an Information-Theoretic Lens: Generalization of   SGD via Training Trajectories and via Terminal States

Ziqiao Wang; Yongyi Mao

arXiv:2211.10691·cs.LG·June 11, 2024

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

Ziqiao Wang, Yongyi Mao

PDF

Open Access

TL;DR

This paper explores how stochastic differential equations (SDEs) can be used to understand the generalization behavior of stochastic gradient descent (SGD) by analyzing training trajectories and terminal states through an information-theoretic lens.

Contribution

It introduces novel generalization bounds for SGD by approximating its dynamics with SDEs, using information-theoretic methods for both trajectory and terminal state analyses.

Findings

01

Trajectory-based bounds outperform previous results.

02

Terminal-state bounds decay rapidly, similar to stability bounds.

03

SDE approximation effectively captures SGD's generalization behavior.

Abstract

Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Gaussian Processes and Bayesian Inference · Machine Learning and ELM

MethodsStochastic Gradient Descent