TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper
Zhongling Su, Rong Fu, Weihan Cao, Jianfei Gao, Minxi Jin, Zhilin Pei, Hui Wang

TL;DR
This paper introduces TMA-Adaptive FP8 Grouped GEMM, a method that eliminates padding in low-precision training and inference, reducing memory and computational overhead while maintaining numerical accuracy.
Contribution
It proposes a dynamic, adaptive approach to FP8 grouped GEMM that removes padding requirements using a TMA descriptor pool and alignment-aware management, enabling efficient low-precision matrix multiplication.
Findings
Achieves 1.7% to 20.4% speedup over state-of-the-art methods.
Reduces memory usage by up to 23.8%.
Maintains full numerical equivalence for valid data.
Abstract
Current FP8 grouped GEMM implementations require padding each group to a fixed alignment (e.g., 128), incurring memory and computational overhead. We propose \textit{TMA-Adaptive FP8 Grouped GEMM}, which eliminates padding by dynamically adapting to variable group dimensions via (1) a TMA descriptor pool with preconfigured descriptors to handle all residual row cases through dynamic runtime selection and dual-phase load-store operations, achieving comprehensive coverage with minimal overhead, and (2) TMA-alignment-aware management to satisfy 16-byte global memory alignment and 128-byte shared memory alignment. Experiments demonstrate 1.7\% to 20.4\% speed up with up to 23.8\% memory reduction compared to padding operation plus state-of-the-art FP8 grouped GEMM, while maintaining full numerical equivalence for valid data. The source code is publicly available at an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
