The anachronism of whole-GPU accounting

Igor Sfiligoi; David Schultz; Frank W\"urthwein; Benedikt Riedel and; Dmitry Y. Mishin

arXiv:2205.09232·cs.DC·July 12, 2022

The anachronism of whole-GPU accounting

Igor Sfiligoi, David Schultz, Frank W\"urthwein, Benedikt Riedel and, Dmitry Y. Mishin

PDF

TL;DR

This paper argues that GPU accounting should shift from whole-GPU metrics to core-hour metrics to better reflect performance differences and sharing, supported by extensive empirical measurements across various GPU models and infrastructures.

Contribution

It introduces a new approach to GPU accounting based on core hours and validates it through comprehensive runtime experiments on multiple GPU models and sharing scenarios.

Findings

01

Whole-GPU accounting is outdated due to performance variability.

02

GPU core hours provide a more accurate measure of compute output.

03

Sharing at infrastructure level impacts GPU utilization and accounting.

Abstract

NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.