MLTCP: Congestion Control for DNN Training

Sudarsanan Rajasekaran; Sanjoli Narang; Anton A. Zabreyko; Manya; Ghobadi

arXiv:2402.09589·cs.NI·February 16, 2024·1 cites

MLTCP: Congestion Control for DNN Training

Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya, Ghobadi

PDF

Open Access

TL;DR

MLTCP is a congestion control technique that improves DNN training speed in shared GPU clusters by enabling interleaved communication phases, effectively stabilizing flows and reducing training iteration times.

Contribution

MLTCP introduces a simple, scalable congestion control modification based on sent bytes per iteration, significantly accelerating DNN training in shared environments.

Findings

01

Up to 2x reduction in average training iteration time.

02

Up to 4x reduction in 99th percentile iteration time.

03

Stable interleaving of flows within a few training iterations.

Abstract

We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with each other, thereby utilizing the network efficiently. At the heart of MLTCP lies a very simple principle based on a key conceptual insight: DNN training flows should scale their congestion window size based on the number of bytes sent at each training iteration. We show that integrating this principle into today's congestion control protocols is straightforward: by adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of different jobs into an interleaved state within a few training iterations, regardless of the number of competing flows or the start time of each flow. Our experiments with popular DNN training jobs demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques