Dart: Divide and Specialize for Fast Response to Congestion in RDMA-based Datacenter Networks
Jiachen Xue, Muhammad Usama Chaudhry, Balajee Vamanan, T. N., Vijaykumar, Mithuna Thottethodi

TL;DR
Dart is a congestion control approach for RDMA datacenter networks that quickly isolates and responds to receiver congestion and in-network congestion, significantly reducing latency and improving throughput.
Contribution
Dart introduces a divide-and-specialize congestion control method that isolates receiver congestion and employs novel switch hardware for fast response, outperforming existing schemes.
Findings
Achieves 60% lower latency in small-scale tests.
Reduces 99th-percentile latency by 79% in simulations.
Provides higher throughput than InfiniBand, TIMELY, and DCQCN.
Abstract
Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10x), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have shown that even in oversubscribed datacenter networks most congestion occurs at the receiver. Accordingly, we propose a divide-and-specialize approach, called Dart, which isolates the common case of receiver congestion and further subdivides the remaining in-network congestion into the simpler spatially-localized and the harder spatially-dispersed cases. For receiver congestion, we propose direct apportioning of sending rates (DASR) in which a receiver for n senders directs each sender to cut…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
