To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability
Joonhyung Lee, Jeongin Bae, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

TL;DR
This paper investigates the stability and robustness of FP8 reduced-precision training for large language models, highlighting current limitations and proposing new evaluation methods to guide future research.
Contribution
It introduces new evaluation techniques and a metric for loss landscape sharpness, analyzing the impact of reduced precision on LLM training stability.
Findings
FP8 training methods are not yet robust enough for cost-effective use
Reduced precision affects training stability across seeds, learning rates, and datasets
Simulation of bit reductions reveals the relationship between precision and stability
Abstract
The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent generations of accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuality and Safety in Healthcare
