Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)
Guo-qing Jiang, Jinlong Liu, Zixiang Ding, Lin Guo, Wei Lin

TL;DR
This paper introduces VRGD, a variance reduced gradient descent method based on GSNR, which accelerates large batch training, reduces generalization gaps, and enables training with larger batch sizes without accuracy loss.
Contribution
The paper proposes VRGD, a novel gradient descent technique based on GSNR, with theoretical analysis and practical validation for large batch training.
Findings
VRGD accelerates training by 1-2x
Reduces generalization gap by over 65%
Enables larger batch sizes for BERT and DLRM without accuracy loss
Abstract
As models for nature language processing (NLP), computer vision (CV) and recommendation systems (RS) require surging computation, a large number of GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput. However, training such LB tasks often meets large generalization gap and downgrades final precision, which limits enlarging the batch size. In this work, we develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR) and apply it onto popular optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of convergence rate to explain its fast training dynamics, and a generalization analysis to demonstrate its smaller generalization gap on LB training. Comprehensive experiments demonstrate that VRGD can accelerate training (), narrow generalization gap and improve final accuracy. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · WordPiece · Dense Connections · Linear Layer · Softmax · Residual Connection · Attention Dropout · Dropout
