The Impact of Neural Network Overparameterization on Gradient Confusion   and Stochastic Gradient Descent

Karthik A. Sankararaman; Soham De; Zheng Xu; W. Ronny Huang; Tom; Goldstein

arXiv:1904.06963·cs.LG·July 8, 2020·35 cites

The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

Karthik A. Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, Tom, Goldstein

PDF

Open Access 1 Video

TL;DR

This paper introduces the concept of gradient confusion to analyze how neural network architecture influences training speed, showing that wider networks train faster while deeper ones slow down, with certain techniques mitigating these effects.

Contribution

It provides a formal analysis linking neural architecture, gradient confusion, and training efficiency, and offers practical insights on initialization and network design.

Findings

01

Wider networks exhibit lower gradient confusion and train faster.

02

Deeper networks tend to have higher gradient confusion, slowing training.

03

Alternative initialization and architectural techniques can reduce training difficulty.

Abstract

This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Batch Normalization