Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel
Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

TL;DR
This paper demonstrates that regularization significantly influences the generalization and sample efficiency of neural networks, showing that regularized neural nets can outperform their kernel equivalents in learning efficiency.
Contribution
It introduces new analysis tools for understanding the impact of regularization on neural nets and kernel methods, and proves that regularized neural nets can be globally optimized with polynomial iterations.
Findings
Regularized neural nets learn with fewer samples than NTK-based methods.
The global minimizer of regularized cross-entropy is the max normalized margin solution.
Gradient descent can efficiently find the regularized global minimum in neural nets.
Abstract
Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with samples but the NTK requires samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM
MethodsNeural Tangent Kernel · *Communicated@Fast*How Do I Communicate to Expedia?
