Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution
Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong, Ma, Cheng Li

TL;DR
DHelix introduces a micro-structure with strand interleaving to co-schedule operators in distributed LLM training, significantly reducing communication costs and improving GPU utilization across various models and clusters.
Contribution
The paper proposes DHelix, a novel micro-structure with strand interleaving that enhances efficiency and reduces communication overhead in distributed LLM training, compatible with existing parallelism methods.
Findings
Achieves 12-40% MFU improvement on A40 clusters.
Reduces communication overhead significantly across models.
Enables effective cross-node tensor parallelism on H100 clusters.
Abstract
The growth of Large Language Models (LLMs) has necessitated large-scale distributed training. Highly optimized frameworks, however, still suffer significant losses in Model FLOPS utilization (often below 50%) due to large communication volumes. Meanwhile, our comprehensive profiling shows that the computation- and communication-intensive operators overlap well. This paper introduces DHelix, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure. Central to DHelix's design is Strand Interleaving (SI), which views the continuous stream of training micro-batches through a GPU as two strands. DHelix juxtaposes the forward and backward passes of the two strands and performs a systematic optimization for an SI plan that co-schedules the operators from the opposite strands, enabled by operator-level overlap profiling results and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Ferroelectric and Negative Capacitance Devices · Machine Learning and ELM
