xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

Jiabo Shi; Dimitrios Pezaros; Yehia Elkhatib

arXiv:2510.21048·cs.PF·October 27, 2025

xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

Jiabo Shi, Dimitrios Pezaros, Yehia Elkhatib

PDF

TL;DR

xMem is a CPU-based framework that accurately estimates GPU memory needs for deep learning workloads, enabling better scheduling and resource utilization without GPU resource consumption.

Contribution

It introduces a novel CPU-only dynamic analysis method for precise GPU memory estimation, outperforming existing approaches in accuracy and resource efficiency.

Findings

01

Reduces median relative error by 91%

02

Decreases probability of estimation failure by 75%

03

Increases memory conservation potential by 368%

Abstract

The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.