Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems
Prasoon Sinha, Akhil Guliani, Rutwik Jain, Brandon Tran, Matthew D., Sinclair, Shivaram Venkataraman

TL;DR
This paper investigates the performance variability in large-scale GPU-accelerated systems, revealing significant differences even among identical GPUs, which impacts efficiency and future system design.
Contribution
It provides the first comprehensive characterization of GPU performance variability across multiple large-scale HPC systems, highlighting its prevalence and implications.
Findings
Average performance variation of 8% across GPUs
Outliers up to 1.5 times slower than median
Variability consistent across different applications and cooling methods
Abstract
Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application paradigm. These systems contain hundreds to tens of thousands of accelerators, enabling peta- and exa-scale levels of compute for scientific workloads. Recent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU (stock keeping unit). This variation occurs due to manufacturing variability and the chip's PM. However, while modern HPC systems widely employ accelerators such as GPUs, it is unclear how much this variability affects applications. Accordingly, we seek to characterize the extent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management
