DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
Zechen Ma, Zixi Qu, Jinyan Yi, David Lin, Yashar Ganjali

TL;DR
This paper introduces DBLP, a novel transport protocol for distributed ML training that dynamically adjusts gradient loss tolerance to improve resilience against network microbursts, reducing training time and tail latency.
Contribution
The paper presents a phase-aware, burst-resilient transport protocol that incorporates model-level insights to optimize gradient communication during distributed training.
Findings
DBLP reduces training time by up to 33.9%.
DBLP achieves up to 5.88x latency speedups during microbursts.
DBLP maintains stable training performance under high-loss events.
Abstract
Distributed machine learning (ML) training has become a necessity with the prevalence of billion to trillion-parameter-scale models. While prior work has improved training efficiency from the ML perspective at the application layer, it often fails to address transient congestion events at the network layer that introduce severe tail latency and training-time variability, thereby undermining the quality of service (QoS) of distributed ML training systems. Existing network optimizations treat all gradients equally and thus fail to integrate sufficient model-training insights into communication protocol design. In this paper, we present Dynamic Bounded-Loss Protocol (DBLP), a burst-resilient, training-phase-aware, and hardware-agnostic transport protocol that incorporates model-level tolerance properties into gradient communication. By dynamically adjusting gradient loss tolerance across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
