Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems
Matthew Sgambati, Aleksandar Vakanski, Matthew Anderson

TL;DR
This paper presents DD-PPO, a scalable decentralized RL scheduler for HPC job scheduling that outperforms traditional and existing RL methods by efficiently handling large datasets and complex system metrics.
Contribution
Introduces a novel decentralized RL algorithm, DD-PPO, enabling scalable, efficient HPC scheduling without centralized parameter updates, improving performance over existing methods.
Findings
DD-PPO outperforms rule-based schedulers in HPC environments.
DD-PPO demonstrates superior scalability with large datasets.
Experimental validation used over 11.5 million HPC job traces.
Abstract
Resource allocation in High Performance Computing (HPC) environments presents a complex and multifaceted challenge for job scheduling algorithms. Beyond the efficient allocation of system resources, schedulers must account for and optimize multiple performance metrics, including job wait time and system utilization. While traditional rule-based scheduling algorithms dominate the current deployments of HPC systems, the increasing heterogeneity and scale of those systems is expected to challenge the efficiency and flexibility of those algorithms in minimizing job wait time and maximizing utilization. Recent research efforts have focused on leveraging advancements in Reinforcement Learning (RL) to develop more adaptable and intelligent scheduling strategies. Recent RL-based scheduling approaches have explored a range of algorithms, from Deep Q-Networks (DQN) to Proximal Policy Optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Mobile Crowdsensing and Crowdsourcing
MethodsDecentralized Distributed Proximal Policy Optimization
