MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration
Anton A. Zabreyko, Weiyang Wang, Manya Ghobadi

TL;DR
MonkeyTree is a system that reduces network congestion in multi-tenant GPU clusters by job migration-based defragmentation, improving training throughput and job completion times without requiring costly network topologies.
Contribution
It introduces a novel migration-based defragmentation approach tailored for ML traffic patterns, with an ILP formulation and efficient implementation, outperforming existing congestion mitigation methods.
Findings
Achieves up to 14% faster job completion times.
Maintains p99 job completion within 5% of ideal at high oversubscription.
Low migration overhead of approximately 9 seconds per worker.
Abstract
We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
