BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters

Kunming Zhang; Hanlong Liao; Junyu Xue; Deke Guo; Guoming Tang

arXiv:2506.15595·cs.DC·January 7, 2026

BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters

Kunming Zhang, Hanlong Liao, Junyu Xue, Deke Guo, Guoming Tang

PDF

TL;DR

BandPilot is a novel GPU dispatching method for AI clusters that learns bandwidth models and predicts contention, significantly improving communication efficiency over traditional topology-based heuristics.

Contribution

It introduces a data-efficient bandwidth modeling and contention-aware dispatching approach that outperforms existing static heuristics in multi-tenant AI clusters.

Findings

01

Achieves 92-97% bandwidth efficiency in experiments.

02

Improves average efficiency by 20-40% over topology-compactness heuristics.

03

Effective in heterogeneous and simulated environments.

Abstract

Modern multi-tenant AI clusters are increasingly communication-bound, driven by high-volume and multi-round GPU-to-GPU collective communication. Consequently, the GPU dispatcher's choice of a physical GPU subset for each tenant largely determines the job's effective collective bandwidth and thus its performance ceiling. Existing dispatchers predominantly rely on static, topology-aware heuristics that prioritize GPU resource compactness, assuming that minimizing physical distance maximizes communication bandwidth. However, we reveal that this assumption often fails due to complex system-level bottlenecks, such as non-linear NIC saturation and inter-node link heterogeneity.This paper presents BandPilot, a performance- and contention-aware GPU dispatching primitive that optimizes effective collective bandwidth for multi-tenant AI clusters. Specifically, BandPilot learns a data-efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.