TL;DR
This paper provides a theoretical analysis of how data geometry influences the generalization ability of overparameterized neural networks trained below the edge of stability, revealing conditions under which models generalize well or tend to memorize.
Contribution
It introduces new theoretical bounds for two-layer ReLU networks that depend on data geometry, unifying previous empirical observations about generalization and memorization.
Findings
Generalization bounds adapt to the intrinsic data dimension.
Rates deteriorate as data concentrates on the sphere.
Gradient descent learns shared patterns when data is hard to shatter.
Abstract
Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
