Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale
Piotr Zielinski, Shankar Krishnan, Satrajit Chatterjee

TL;DR
This paper validates the Coherent Gradients hypothesis at scale by developing new methods to suppress weak gradient directions, demonstrating improved generalization and reduced memorization in large neural networks trained on ImageNet.
Contribution
It introduces scalable algorithms for suppressing weak gradient directions, providing strong empirical evidence for CGH in large-scale neural network training.
Findings
Suppression of weak gradient directions reduces overfitting.
Easy examples tend to generalize better and are learned earlier.
New methods enable validation of CGH on large datasets like ImageNet.
Abstract
Coherent Gradients (CGH) is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. The key insight of CGH is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. In this paper, we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the techniques presented in the original paper do not scale beyond toy models and datasets, we propose new methods. By posing the problem of suppressing weak gradient directions as a problem of robust mean estimation, we develop a coordinate-based median of means approach. We present two versions of this algorithm, M3, which partitions a mini-batch into 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsBatch Normalization · 1x1 Convolution · Average Pooling · Kaiming Initialization · Residual Block · Residual Connection · Bottleneck Residual Block · Global Average Pooling · Bitcoin Customer Service Number +1-833-534-1729 · Convolution
