CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
Sudarsanan Rajasekaran (1), Manya Ghobadi (1), Aditya Akella (2) ((1), Massachusetts Institute of Technology, (2) UT Austin)

TL;DR
CASSINI is a network-aware job scheduler for ML clusters that optimizes communication patterns, significantly improving job completion times and reducing network congestion compared to existing schedulers.
Contribution
It introduces a geometric abstraction and affinity graph-based approach to interleave communication phases, enhancing network utilization in ML job scheduling.
Findings
Up to 1.6x faster average job completion time
Up to 2.5x faster tail job completion time
Reduces ECN marked packets by up to 33x
Abstract
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Interconnection Networks and Systems · Parallel Computing and Optimization Techniques
