Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali,, Aditya Akella, Tushar Krishna

TL;DR
This paper analyzes the effectiveness of existing RoCE congestion control schemes in distributed DNN training environments, revealing their limited impact and emphasizing the need for specialized solutions tailored to training workloads.
Contribution
The study provides a detailed comparison of state-of-the-art RoCE congestion control schemes against PFC in distributed training platforms, highlighting their inadequacies and proposing the need for optimized solutions.
Findings
Existing schemes have minimal impact on training performance.
Distributed training platforms have unique network characteristics.
Specialized congestion control schemes are necessary for optimal training.
Abstract
RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes have been proposed to provide additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments. In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Advanced Optical Network Technologies · Network Traffic and Congestion Control
