Weak and Strong Gradient Directions: Explaining Memorization,   Generalization, and Hardness of Examples at Scale

Piotr Zielinski; Shankar Krishnan; Satrajit Chatterjee

arXiv:2003.07422·cs.LG·July 22, 2020·5 cites

Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale

Piotr Zielinski, Shankar Krishnan, Satrajit Chatterjee

PDF

Open Access

TL;DR

This paper validates the Coherent Gradients hypothesis at scale by developing new methods to suppress weak gradient directions, demonstrating improved generalization and reduced memorization in large neural networks trained on ImageNet.

Contribution

It introduces scalable algorithms for suppressing weak gradient directions, providing strong empirical evidence for CGH in large-scale neural network training.

Findings

01

Suppression of weak gradient directions reduces overfitting.

02

Easy examples tend to generalize better and are learned earlier.

03

New methods enable validation of CGH on large datasets like ImageNet.

Abstract

Coherent Gradients (CGH) is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. The key insight of CGH is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. In this paper, we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the techniques presented in the original paper do not scale beyond toy models and datasets, we propose new methods. By posing the problem of suppressing weak gradient directions as a problem of robust mean estimation, we develop a coordinate-based median of means approach. We present two versions of this algorithm, M3, which partitions a mini-batch into 3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsBatch Normalization · 1x1 Convolution · Average Pooling · Kaiming Initialization · Residual Block · Residual Connection · Bottleneck Residual Block · Global Average Pooling · Bitcoin Customer Service Number +1-833-534-1729 · Convolution