Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Yingzhen Yang; Ping Li

arXiv:2407.11353·stat.ML·October 7, 2025

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Yingzhen Yang, Ping Li

PDF

Open Access

TL;DR

This paper demonstrates that over-parameterized neural networks trained with preconditioned gradient descent and early stopping can achieve sharp nonparametric regression rates, surpassing standard kernel regression and NTK regime results.

Contribution

The authors introduce a novel analysis framework for neural network training, showing improved generalization rates via a new kernel decomposition and local Rademacher complexity control.

Findings

01

Achieves regression rate of O(n^{-2αs'/(2αs'+1)}) for target functions in interpolation space.

02

Surpasses nearly-optimal and standard NTK regression rates.

03

Provides theoretical evidence that PGD enables neural networks to escape the NTK regime.

Abstract

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in $\RR^{d}$ , and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of $\cO (n^{- \frac{2 α s ^{'}}{2 α s ^{'} + 1}})$ when the target function is in the interpolation space $\bth \cH_{K}^{s^{'}}$ with $s^{'} \geq 3$ . This rate is even sharper than the currently known nearly-optimal rate of $\cO (n^{- \frac{2 α s ^{'}}{2 α s ^{'} + 1}}) lo g^{2} (1/ δ)$ ~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $δ \in (0, 1)$ is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Face and Expression Recognition

MethodsNeural Tangent Kernel · Early Stopping