Rethinking floating point for deep learning
Jeff Johnson

TL;DR
This paper introduces a novel 8-bit log float format with hybrid log multiply/linear add, achieving energy efficiency and accuracy comparable to float32 for deep learning inference without retraining.
Contribution
It presents a new floating point format that is more energy-efficient than integer hardware at the same bit width, with minimal accuracy loss and no need for retraining.
Findings
8-bit log float achieves within 0.9% top-1 accuracy of float32 ResNet-50.
Power consumption is 0.96x that of 8/32-bit integer multiply-add.
In 16 bits, it is 0.59x the power of IEEE float16, maintaining precision.
Abstract
Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson's posit format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Neural Network Applications · Machine Learning and Data Classification
