CARMA: Collocation-Aware Resource Manager

Ehsan Yousefzadeh-Asl-Miandoab; Florina M. Ciorba; P{\i}nar T\"oz\"un

arXiv:2508.19073·cs.DC·February 24, 2026

CARMA: Collocation-Aware Resource Manager

Ehsan Yousefzadeh-Asl-Miandoab, Florina M. Ciorba, P{\i}nar T\"oz\"un

PDF

Open Access

TL;DR

CARMA is a resource management system that improves GPU utilization for deep learning workloads by intelligently collocating tasks while minimizing memory errors and performance interference.

Contribution

CARMA introduces a collocation-aware GPU resource manager with risk analysis, memory need estimation, and recovery techniques to enhance efficiency and robustness.

Findings

01

Increases GPU SM utilization by 54%.

02

Reduces workload makespan by 35%.

03

Cuts GPU energy consumption by 15%.

Abstract

GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource manager for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to limit OOMs and interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Big Data and Digital Economy