Provable Generalization in Overparameterized Neural Nets
Aviral Dhingra

TL;DR
This paper proposes a new capacity measure based on the effective rank of attention matrices in neural networks, providing a theoretical explanation for their generalization despite overparameterization.
Contribution
It introduces an alternative capacity measure for attention models and derives a generalization bound aligned with empirical scaling laws.
Findings
Effective rank correlates with generalization performance.
Spectral properties of attention matrices explain overparameterized neural network success.
Bound matches empirical scaling laws up to logarithmic factors.
Abstract
Deep neural networks often contain far more parameters than training examples, yet they still manage to generalize well in practice. Classical complexity measures such as VC-dimension or PAC-Bayes bounds usually become vacuous in this overparameterized regime, offering little explanation for the empirical success of models like Transformers. In this work, I explore an alternative notion of capacity for attention-based models, based on the effective rank of their attention matrices. The intuition is that, although the parameter count is enormous, the functional dimensionality of attention is often much lower. I show that this quantity leads to a generalization bound whose dependence on sample size matches empirical scaling laws observed in large language models, up to logarithmic factors. While the analysis is not a complete theory of overparameterized learning, it provides evidence that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
