Proportionate gradient updates with PercentDelta

Sami Abu-El-Haija

arXiv:1708.07227·cs.LG·August 25, 2017·5 cites

Proportionate gradient updates with PercentDelta

Sami Abu-El-Haija

PDF

Open Access

TL;DR

This paper introduces PercentDelta, a method that scales gradient updates to ensure proportional changes across all layers, resulting in faster training and higher accuracy in neural networks.

Contribution

It proposes a novel gradient scaling technique that equalizes layer-wise relative changes, improving training speed and accuracy.

Findings

01

Gradients vary greatly across layers, affecting training speed.

02

PercentDelta effectively equalizes layer updates during training.

03

Method achieves faster convergence and higher test accuracy on MNIST.

Abstract

Deep Neural Networks are generally trained using iterative gradient updates. Magnitudes of gradients are affected by many factors, including choice of activation functions and initialization. More importantly, gradient magnitudes can greatly differ across layers, with some layers receiving much smaller gradients than others. causing some layers to train slower than others and therefore slowing down the overall convergence. We analytically explain this disproportionality. Then we propose to explicitly train all layers at the same speed, by scaling the gradient w.r.t. every trainable tensor to be proportional to its current value. In particular, at every batch, we want to update all trainable tensors, such that the relative change of the L1-norm of the tensors is the same, across all layers of the network, throughout training time. Experiments on MNIST show that our method appropriately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications