Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
Qiong Wu, Zhenming Liu

TL;DR
Rosella is a self-driving distributed scheduler that automatically adapts to heterogeneous cluster environments, achieving high throughput and low latency for real-time web and AI applications.
Contribution
It introduces a novel self-driving, distributed scheduling approach that learns and adjusts in real-time, reducing queue lengths and improving response times in heterogeneous clusters.
Findings
Reduces max queue length from O(log n) to O(log log n)
Significantly decreases task response time
Adapts quickly to environment changes
Abstract
Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously because it runs in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Age of Information Optimization
