TL;DR
This paper develops bit-accurate models of GPU matrix multiply-accumulate units to explain numerical discrepancies and accuracy issues across architectures, aiding diagnosis and guiding future design.
Contribution
It introduces a systematic framework for constructing complete arithmetic models of MMAUs, providing the first bit-accurate analysis across multiple GPU architectures.
Findings
Models explain cross-platform numerical discrepancies.
Identifies four precision bottlenecks affecting accuracy.
Provides software workarounds and design guidance.
Abstract
Modern AI accelerators rely on matrix multiply-accumulate units (MMAUs), such as NVIDIA Tensor Cores and AMD Matrix Cores, to accelerate deep neural network workloads. MMAUs expose only instruction-level or API-level interfaces of matrix multiply-accumulate (MMA) operations, while leaving internal floating-point arithmetic behaviors undocumented. Consequently, MMAUs across vendors and architectural generations often produce numerical discrepancies for identical inputs, and sometimes exhibit reduced numerical accuracy that can cause training instability. Diagnosing and understanding the root causes of these effects is challenging without white-box models of their arithmetic behaviors. This paper proposes closed-loop feature probing (CLFP), a generic and systematic framework for constructing complete arithmetic behavior models of MMA operations. Based on this framework, we analyze all MMA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
