Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi,, Sanjiv Kumar

TL;DR
This paper introduces a scale-invariant neural network architecture and training method that enables SGD to match the robustness and performance of adaptive optimizers like Adam, with added memory efficiency.
Contribution
The paper proposes a novel scale-invariant architecture and training procedure that allows SGD to achieve robustness comparable to adaptive methods, reducing memory usage.
Findings
Scale-invariant BERT (SIBERT) performs comparably to Adam-trained BERT.
The proposed method converges logarithmically with respect to initialization and loss scale.
Standard SGD may fail to converge under certain initializations.
Abstract
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by , where is learning rate and is weight decay. We show that this general approach is robust to rescaling of parameter and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Residual Connection · Softmax · Dropout
