Hugo: A Cluster Scheduler that Efficiently Learns to Select   Complementary Data-Parallel Jobs

Lauritz Thamsen; Ilya Verbitskiy; Sasho Nedelkoski; Vinh Thuy Tran,; Vinicius Meyer; Miguel G. Xavier; Odej Kao; Cesar A. F. De Rose

arXiv:2102.07199·cs.DC·February 16, 2021

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Lauritz Thamsen, Ilya Verbitskiy, Sasho Nedelkoski, Vinh Thuy Tran,, Vinicius Meyer, Miguel G. Xavier, Odej Kao, Cesar A. F. De Rose

PDF

TL;DR

Hugo is a cluster scheduler that uses reinforcement learning to optimize the co-location of data-parallel jobs, improving resource utilization and reducing job runtimes in distributed data processing systems.

Contribution

It introduces a novel scheduler combining offline job grouping with online reinforcement learning to adaptively optimize resource sharing among co-located jobs.

Findings

01

Reduces Spark job runtimes by up to 12.5%

02

Increases resource utilization

03

Bounds waiting times

Abstract

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.