Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Shruti Dongare; Redwan Ibne Seraj Khan; Hadeel Albahar; Nannan Zhao; Diego Melendez Maita; Ali R. Butt

arXiv:2512.10271·cs.DC·December 12, 2025

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Shruti Dongare, Redwan Ibne Seraj Khan, Hadeel Albahar, Nannan Zhao, Diego Melendez Maita, Ali R. Butt

PDF

Open Access

TL;DR

This paper introduces RLTune, a reinforcement learning-based dynamic scheduling framework for heterogeneous GPU clusters, significantly improving resource utilization and reducing delays for large-scale deep learning workloads without needing per-job profiling.

Contribution

RLTune is an application-agnostic RL-based scheduler that combines RL prioritization with MILP-based job mapping, enabling scalable and efficient DL workload management on heterogeneous GPU clusters.

Findings

01

GPU utilization increased by up to 20%

02

Queueing delay reduced by up to 81%

03

Job completion time shortened by up to 70%

Abstract

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Big Data and Digital Economy · Parallel Computing and Optimization Techniques