MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

Anton A. Zabreyko; Weiyang Wang; Manya Ghobadi

arXiv:2602.08296·cs.NI·February 11, 2026

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

Anton A. Zabreyko, Weiyang Wang, Manya Ghobadi

PDF

Open Access

TL;DR

MonkeyTree is a system that reduces network congestion in multi-tenant GPU clusters by job migration-based defragmentation, improving training throughput and job completion times without requiring costly network topologies.

Contribution

It introduces a novel migration-based defragmentation approach tailored for ML traffic patterns, with an ILP formulation and efficient implementation, outperforming existing congestion mitigation methods.

Findings

01

Achieves up to 14% faster job completion times.

02

Maintains p99 job completion within 5% of ideal at high oversubscription.

03

Low migration overhead of approximately 9 seconds per worker.

Abstract

We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques