CARMA: Collocation-Aware Resource Manager
Ehsan Yousefzadeh-Asl-Miandoab, Florina M. Ciorba, P{\i}nar T\"oz\"un

TL;DR
CARMA is a resource management system that improves GPU utilization for deep learning workloads by intelligently collocating tasks while minimizing memory errors and performance interference.
Contribution
CARMA introduces a collocation-aware GPU resource manager with risk analysis, memory need estimation, and recovery techniques to enhance efficiency and robustness.
Findings
Increases GPU SM utilization by 54%.
Reduces workload makespan by 35%.
Cuts GPU energy consumption by 15%.
Abstract
GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource manager for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to limit OOMs and interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Big Data and Digital Economy
