The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
Karthik A. Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, Tom, Goldstein

TL;DR
This paper introduces the concept of gradient confusion to analyze how neural network architecture influences training speed, showing that wider networks train faster while deeper ones slow down, with certain techniques mitigating these effects.
Contribution
It provides a formal analysis linking neural architecture, gradient confusion, and training efficiency, and offers practical insights on initialization and network design.
Findings
Wider networks exhibit lower gradient confusion and train faster.
Deeper networks tend to have higher gradient confusion, slowing training.
Alternative initialization and architectural techniques can reduce training difficulty.
Abstract
This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Batch Normalization
