Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning

Martin Asenov; Qiwen Deng; Gingfung Yeung; Adam Barker

arXiv:2603.10545·cs.LG·March 12, 2026

Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning

Martin Asenov, Qiwen Deng, Gingfung Yeung, Adam Barker

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning method to automatically tune cluster scheduler weights, significantly enhancing job performance and cluster utilization without requiring expert intervention.

Contribution

It presents a novel RL-based approach for dynamic scheduler tuning that adapts to different workloads and cluster setups, outperforming fixed and manually tuned weights.

Findings

01

Average performance improvement of 33% over fixed weights

02

12% improvement over the best baseline in lab scenarios

03

Effective generalization to unseen clusters and workloads

Abstract

Efficiently allocating incoming jobs to nodes in large-scale clusters can lead to substantial improvements in both cluster utilization and job performance. In order to allocate incoming jobs, cluster schedulers usually rely on a set of scoring functions to rank feasible nodes. Results from individual scoring functions are usually weighted equally, which could lead to sub-optimal deployments as the one-size-fits-all solution does not take into account the characteristics of each workload. Tuning the weights of scoring functions, however, requires expert knowledge and is computationally expensive. This paper proposes a reinforcement learning approach for learning the weights in scheduler scoring algorithms with the overall objective of improving the end-to-end performance of jobs for a given cluster. Our approach is based on percentage improvement reward, frame-stacking, and limiting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · IoT and Edge/Fog Computing