GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Danyal Yorulmaz, Bulat Ibragimov, P{\i}nar T\"oz\"un

TL;DR
This paper systematically analyzes GPU memory and utilization estimators for training-aware resource management, highlighting their limitations and evaluating their accuracy, generalizability, and overhead across diverse models and hardware.
Contribution
It provides a comprehensive evaluation of existing estimators, introduces a lightweight ML-based estimator, and discusses the challenges in generalization and integration for GPU resource estimation.
Findings
Analytical models lack generalization to new architectures.
CPU-side libraries have high integration overhead.
ML-based estimators show promise but face generalization challenges.
Abstract
Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
