Hiding Communication Cost in Distributed LLM Training via Micro-batch   Co-execution

Haiquan Wang; Chaoyi Ruan; Jia He; Jiaqi Ruan; Chengjie Tang; Xiaosong; Ma; Cheng Li

arXiv:2411.15871·cs.DC·November 26, 2024

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong, Ma, Cheng Li

PDF

Open Access

TL;DR

DHelix introduces a micro-structure with strand interleaving to co-schedule operators in distributed LLM training, significantly reducing communication costs and improving GPU utilization across various models and clusters.

Contribution

The paper proposes DHelix, a novel micro-structure with strand interleaving that enhances efficiency and reduces communication overhead in distributed LLM training, compatible with existing parallelism methods.

Findings

01

Achieves 12-40% MFU improvement on A40 clusters.

02

Reduces communication overhead significantly across models.

03

Enables effective cross-node tensor parallelism on H100 clusters.

Abstract

The growth of Large Language Models (LLMs) has necessitated large-scale distributed training. Highly optimized frameworks, however, still suffer significant losses in Model FLOPS utilization (often below 50%) due to large communication volumes. Meanwhile, our comprehensive profiling shows that the computation- and communication-intensive operators overlap well. This paper introduces DHelix, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure. Central to DHelix's design is Strand Interleaving (SI), which views the continuous stream of training micro-batches through a GPU as two strands. DHelix juxtaposes the forward and backward passes of the two strands and performs a systematic optimization for an SI plan that co-schedules the operators from the opposite strands, enabled by operator-level overlap profiling results and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModular Robots and Swarm Intelligence · Ferroelectric and Negative Capacitance Devices · Machine Learning and ELM