Towards Fully FP8 GEMM LLM Training at Scale

Alejandro Hern\'andez-Cano; Dhia Garbaya; Imanol Schlag; Martin Jaggi

arXiv:2505.20524·cs.LG·October 28, 2025

Towards Fully FP8 GEMM LLM Training at Scale

Alejandro Hern\'andez-Cano, Dhia Garbaya, Imanol Schlag, Martin Jaggi

PDF

Open Access 1 Video

TL;DR

This paper presents a novel LLM training architecture that fully utilizes FP8 GEMMs during training, achieving high throughput and stability comparable to BF16, enabling scalable and efficient large language model pre-training.

Contribution

It introduces a new architecture supporting FP8 GEMMs throughout transformer training, addressing stability issues and enabling scalable, high-throughput LLM training at scale.

Findings

01

Supports FP8 GEMMs in all transformer GEMMs during training

02

Achieves throughput gains comparable to BF16 training

03

Maintains stable training with reduced outlier activations

Abstract

Despite the significant potential of FP8 data formats for large language model (LLM) pre-training, their adoption has been limited due to challenges in maintaining stability at scale. Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications (GEMMs) in sensitive components, such as attention projections, compromising potential throughput gains. We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes. This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training. Our architecture design reduces large outlier activations, promoting stable long-term FP8 training. In addition, we identify key metrics to monitor low-precision training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Fully FP8 GEMM LLM Training at Scale· slideslive

Taxonomy

TopicsSuperconducting Materials and Applications · Educational Technology and Assessment