xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
Jiabo Shi, Dimitrios Pezaros, Yehia Elkhatib

TL;DR
xMem is a CPU-based framework that accurately estimates GPU memory needs for deep learning workloads, enabling better scheduling and resource utilization without GPU resource consumption.
Contribution
It introduces a novel CPU-only dynamic analysis method for precise GPU memory estimation, outperforming existing approaches in accuracy and resource efficiency.
Findings
Reduces median relative error by 91%
Decreases probability of estimation failure by 75%
Increases memory conservation potential by 368%
Abstract
The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
