Dart: Divide and Specialize for Fast Response to Congestion in   RDMA-based Datacenter Networks

Jiachen Xue; Muhammad Usama Chaudhry; Balajee Vamanan; T. N.; Vijaykumar; Mithuna Thottethodi

arXiv:1805.11158·cs.NI·January 1, 2020

Dart: Divide and Specialize for Fast Response to Congestion in RDMA-based Datacenter Networks

Jiachen Xue, Muhammad Usama Chaudhry, Balajee Vamanan, T. N., Vijaykumar, Mithuna Thottethodi

PDF

TL;DR

Dart is a congestion control approach for RDMA datacenter networks that quickly isolates and responds to receiver congestion and in-network congestion, significantly reducing latency and improving throughput.

Contribution

Dart introduces a divide-and-specialize congestion control method that isolates receiver congestion and employs novel switch hardware for fast response, outperforming existing schemes.

Findings

01

Achieves 60% lower latency in small-scale tests.

02

Reduces 99th-percentile latency by 79% in simulations.

03

Provides higher throughput than InfiniBand, TIMELY, and DCQCN.

Abstract

Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10x), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have shown that even in oversubscribed datacenter networks most congestion occurs at the receiver. Accordingly, we propose a divide-and-specialize approach, called Dart, which isolates the common case of receiver congestion and further subdivides the remaining in-network congestion into the simpler spatially-localized and the harder spatially-dispersed cases. For receiver congestion, we propose direct apportioning of sending rates (DASR) in which a receiver for n senders directs each sender to cut…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.