Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High   Performance Computing Scheduling on Multi-User Systems

Matthew Sgambati; Aleksandar Vakanski; Matthew Anderson

arXiv:2505.03946·cs.DC·May 8, 2025

Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems

Matthew Sgambati, Aleksandar Vakanski, Matthew Anderson

PDF

Open Access

TL;DR

This paper presents DD-PPO, a scalable decentralized RL scheduler for HPC job scheduling that outperforms traditional and existing RL methods by efficiently handling large datasets and complex system metrics.

Contribution

Introduces a novel decentralized RL algorithm, DD-PPO, enabling scalable, efficient HPC scheduling without centralized parameter updates, improving performance over existing methods.

Findings

01

DD-PPO outperforms rule-based schedulers in HPC environments.

02

DD-PPO demonstrates superior scalability with large datasets.

03

Experimental validation used over 11.5 million HPC job traces.

Abstract

Resource allocation in High Performance Computing (HPC) environments presents a complex and multifaceted challenge for job scheduling algorithms. Beyond the efficient allocation of system resources, schedulers must account for and optimize multiple performance metrics, including job wait time and system utilization. While traditional rule-based scheduling algorithms dominate the current deployments of HPC systems, the increasing heterogeneity and scale of those systems is expected to challenge the efficiency and flexibility of those algorithms in minimizing job wait time and maximizing utilization. Recent research efforts have focused on leveraging advancements in Reinforcement Learning (RL) to develop more adaptable and intelligent scheduling strategies. Recent RL-based scheduling approaches have explored a range of algorithms, from Deep Q-Networks (DQN) to Proximal Policy Optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Mobile Crowdsensing and Crowdsourcing

MethodsDecentralized Distributed Proximal Policy Optimization