Adaptive Loss Scaling for Mixed Precision Training

Ruizhe Zhao; Brian Vogel; Tanvir Ahmed

arXiv:1910.12385·cs.LG·October 29, 2019·6 cites

Adaptive Loss Scaling for Mixed Precision Training

Ruizhe Zhao, Brian Vogel, Tanvir Ahmed

PDF

Open Access

TL;DR

This paper proposes an adaptive loss scaling method for mixed precision training that automatically adjusts layer-wise loss scales during training, eliminating the need for manual hyperparameter tuning and enhancing training efficiency and accuracy.

Contribution

It introduces a novel adaptive loss scaling technique with layer-wise adjustments, improving mixed precision training's practicality and performance.

Findings

01

Reduces training time to convergence

02

Improves model accuracy

03

Eliminates hyperparameter tuning for loss scale

Abstract

Mixed precision training (MPT) is becoming a practical technique to improve the speed and energy efficiency of training deep neural networks by leveraging the fast hardware support for IEEE half-precision floating point that is available in existing GPUs. MPT is typically used in combination with a technique called loss scaling, that works by scaling up the loss value up before the start of backpropagation in order to minimize the impact of numerical underflow on training. Unfortunately, existing methods make this loss scale value a hyperparameter that needs to be tuned per-model, and a single scale cannot be adapted to different layers at different training stages. We introduce a loss scaling-based training method called adaptive loss scaling that makes MPT easier and more practical to use, by removing the need to tune a model-specific loss scale hyperparameter. We achieve this by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Advanced Neural Network Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adaptive Robust Loss