GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Ehsan Yousefzadeh-Asl-Miandoab; Reza Karimzadeh; Danyal Yorulmaz; Bulat Ibragimov; P{\i}nar T\"oz\"un

arXiv:2602.17817·cs.DC·April 29, 2026

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Danyal Yorulmaz, Bulat Ibragimov, P{\i}nar T\"oz\"un

PDF

TL;DR

This paper systematically analyzes GPU memory and utilization estimators for training-aware resource management, highlighting their limitations and evaluating their accuracy, generalizability, and overhead across diverse models and hardware.

Contribution

It provides a comprehensive evaluation of existing estimators, introduces a lightweight ML-based estimator, and discusses the challenges in generalization and integration for GPU resource estimation.

Findings

01

Analytical models lack generalization to new architectures.

02

CPU-side libraries have high integration overhead.

03

ML-based estimators show promise but face generalization challenges.

Abstract

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.