FP8 Formats for Deep Learning
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea,, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick, Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi,, Michael Siu, Hao Wu

TL;DR
This paper introduces two FP8 floating point formats, E4M3 and E5M2, designed to accelerate deep learning by matching 16-bit training quality across various architectures and tasks, including large language models.
Contribution
The paper proposes two novel FP8 formats, E4M3 and E5M2, demonstrating their effectiveness in deep learning training and post-training quantization without hyperparameter changes.
Findings
FP8 formats match 16-bit training quality on multiple tasks
Effective for CNNs, RNNs, and Transformers, including large language models
Enables efficient post-training quantization of resistant models
Abstract
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · CCD and CMOS Imaging Sensors
