Instant GPU Efficiency Visibility at Fleet Scale
Connor Pedersen, Dong H. Ahn, Michel Migdal, Collin Neale, Nik Konyuchenko

TL;DR
This paper introduces OFU, a hardware-level GPU efficiency metric that accurately monitors AI workload performance across diverse GPU generations and precisions without application modifications.
Contribution
The paper presents OFU, a novel, hardware-based GPU efficiency metric that requires no application instrumentation and is effective across multiple GPU models and numeric precisions.
Findings
OFU predicts application MFU within <=2% after correction.
OFU correlates at r=0.78 with application MFU on 608 jobs.
OFU detected a 2.5x efficiency regression in large-scale GPU fleets.
Abstract
We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation -- tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting -- through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
