Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning
Martin Asenov, Qiwen Deng, Gingfung Yeung, Adam Barker

TL;DR
This paper introduces a reinforcement learning method to automatically tune cluster scheduler weights, significantly enhancing job performance and cluster utilization without requiring expert intervention.
Contribution
It presents a novel RL-based approach for dynamic scheduler tuning that adapts to different workloads and cluster setups, outperforming fixed and manually tuned weights.
Findings
Average performance improvement of 33% over fixed weights
12% improvement over the best baseline in lab scenarios
Effective generalization to unseen clusters and workloads
Abstract
Efficiently allocating incoming jobs to nodes in large-scale clusters can lead to substantial improvements in both cluster utilization and job performance. In order to allocate incoming jobs, cluster schedulers usually rely on a set of scoring functions to rank feasible nodes. Results from individual scoring functions are usually weighted equally, which could lead to sub-optimal deployments as the one-size-fits-all solution does not take into account the characteristics of each workload. Tuning the weights of scoring functions, however, requires expert knowledge and is computationally expensive. This paper proposes a reinforcement learning approach for learning the weights in scheduler scoring algorithms with the overall objective of improving the end-to-end performance of jobs for a given cluster. Our approach is based on percentage improvement reward, frame-stacking, and limiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · IoT and Edge/Fog Computing
