DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training
Weiyan Wang, Cengguang Zhang, Liu Yang, Kai Chen, Kun Tan

TL;DR
DS-Sync is a novel synchronization method for distributed DNN training that improves communication efficiency by dividing workers into groups and shuffling them, significantly reducing training time without losing accuracy.
Contribution
Introduces DS-Sync, a new divide-and-shuffle synchronization approach that addresses network bottlenecks while ensuring convergence in distributed deep learning.
Findings
Achieves up to 94% reduction in training time.
Maintains model accuracy comparable to existing methods.
Proven to converge properly in non-convex, smooth conditions.
Abstract
Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Interconnection Networks and Systems · Ferroelectric and Negative Capacitance Devices
