FAST: An Efficient Scheduler for All-to-All GPU Communication

Yiran Lei; Dongjoo Lee; Liangyu Zhao; Daniar Kurniawan; Chanmyeong Kim; Heetaek Jeong; Changsu Kim; Hyeonseong Choi; Liangcheng Yu; Arvind Krishnamurthy; Justine Sherry; Eriko Nurvitadhi

arXiv:2505.09764·cs.DC·March 9, 2026

FAST: An Efficient Scheduler for All-to-All GPU Communication

Yiran Lei, Dongjoo Lee, Liangyu Zhao, Daniar Kurniawan, Chanmyeong Kim, Heetaek Jeong, Changsu Kim, Hyeonseong Choi, Liangcheng Yu, Arvind Krishnamurthy, Justine Sherry, Eriko Nurvitadhi

PDF

Open Access

TL;DR

FAST is a scalable, efficient scheduler for All-to-All communication in GPU clusters, significantly improving performance and synthesis time for skewed workloads in machine learning models.

Contribution

The paper introduces FAST, a novel scheduler that handles workload skew and dynamic traffic efficiently, outperforming existing solutions in speed and scalability.

Findings

01

FAST outperforms state-of-the-art schedulers on skewed workloads.

02

FAST reduces synthesis time by orders of magnitude.

03

FAST maintains balanced transfers avoiding incast congestion.

Abstract

All-to-All(v) communication is a critical primitive in modern machine learning workloads, particularly mixture-of-experts (MoE) models. Unfortunately, efficient scheduling is challenging due to workload skew, heterogeneous two-tier fabrics, and incast congestion, compounded by the dynamic nature of MoE workloads, where traffic shifts every few hundred milliseconds. Existing schedulers are hardly scalable, incurring seconds to hours of synthesis time, making them impractical. We present FAST, an efficient All-to-All(v) scheduler. FAST addresses skew through intra-server rebalancing and enforces balanced, one-to-one scale-out transfers that avoid incast. Evaluated extensively on both NVIDIA H200 and AMD MI300X clusters, FAST consistently outperforms state-of-the-art solutions on skewed workloads while reducing synthesis time by orders of magnitude.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Real-Time Systems Scheduling