Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks
Christopher W. F. Parsonson, Zacharaya Shabka, Alessandro Ottino, and, Georgios Zervas

TL;DR
This paper introduces PAC-ML, a reinforcement learning and graph neural network-based method that optimizes job partitioning in distributed computing, significantly reducing blocking rates compared to traditional maximum parallelization strategies.
Contribution
It presents a novel approach combining reinforcement learning and graph neural networks to determine optimal job partitioning for improved throughput and reduced blocking in distributed systems.
Findings
PAC-ML achieves up to 56.2% lower blocking rates.
It outperforms maximum parallelization strategies in diverse JCT scenarios.
The method adapts effectively to different user-defined JCT requirements.
Abstract
From natural language processing to genome sequencing, large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and instead must be distributed across multiple devices. This has motivated the research of new compute and network systems capable of handling such tasks. In particular, recent work has focused on developing management schemes which decide how to allocate distributed resources such that some overall objective, such as minimising the job completion time (JCT), is optimised. However, such studies omit explicit consideration of how much a job should be distributed, usually assuming that maximum distribution is desirable. In this work, we show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate. To address this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques
MethodsGraph Neural Network
