Precise gradient descent training dynamics for finite-width multi-layer   neural networks

Qiyang Han; Masaaki Imaizumi

arXiv:2505.04898·cs.LG·May 9, 2025

Precise gradient descent training dynamics for finite-width multi-layer neural networks

Qiyang Han, Masaaki Imaizumi

PDF

Open Access

TL;DR

This paper provides a detailed, non-asymptotic analysis of gradient descent dynamics for finite-width multi-layer neural networks, capturing fluctuations and generalization beyond existing theories.

Contribution

It introduces the first finite-width, non-asymptotic state evolution theory for multi-layer neural networks, extending understanding beyond NTK, MF, and TP frameworks.

Findings

01

Captures Gaussian fluctuations in first-layer weights.

02

Allows weights to evolve from individual initializations.

03

Enables estimation of generalization error during training.

Abstract

In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime' where the sample size and feature dimension grow proportionally while the network width and depth remain bounded. Our non-asymptotic state evolution theory captures Gaussian fluctuations in first-layer weights and concentration in deeper-layer weights, and remains valid for non-Gaussian features. Our theory differs from existing neural tangent kernel (NTK), mean-field (MF) theories and tensor program (TP) in several key aspects. First, our theory operates in the finite-width regime whereas these existing theories are fundamentally infinite-width. Second, our theory allows weights to evolve from individual initializations beyond the lazy training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Model Reduction and Neural Networks

MethodsNeural Tangent Kernel · Early Stopping