Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs
Anwar Hossain Zahid, Ignacio Laguna, Wei Le

TL;DR
This paper investigates numerical differences between NVIDIA and AMD GPUs caused by compiler variations, using extensive testing to identify subtle inconsistencies in scientific computations.
Contribution
It introduces a comprehensive differential testing approach using Varity and HIPIFY to detect compiler-induced numerical differences across GPU platforms.
Findings
Identified numerical differences due to math library calls
Detected precision-related discrepancies between FP64 and FP32
Uncovered inconsistencies introduced by HIPIFY conversion
Abstract
As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
