Robust Training of Neural Networks Using Scale Invariant Architectures

Zhiyuan Li; Srinadh Bhojanapalli; Manzil Zaheer; Sashank J. Reddi,; Sanjiv Kumar

arXiv:2202.00980·cs.LG·July 20, 2022·1 cites

Robust Training of Neural Networks Using Scale Invariant Architectures

Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi,, Sanjiv Kumar

PDF

Open Access

TL;DR

This paper introduces a scale-invariant neural network architecture and training method that enables SGD to match the robustness and performance of adaptive optimizers like Adam, with added memory efficiency.

Contribution

The paper proposes a novel scale-invariant architecture and training procedure that allows SGD to achieve robustness comparable to adaptive methods, reducing memory usage.

Findings

01

Scale-invariant BERT (SIBERT) performs comparably to Adam-trained BERT.

02

The proposed method converges logarithmically with respect to initialization and loss scale.

03

Standard SGD may fail to converge under certain initializations.

Abstract

In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\frac{2 λ}{η}$ , where $η$ is learning rate and $λ$ is weight decay. We show that this general approach is robust to rescaling of parameter and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Residual Connection · Softmax · Dropout