Analyzing Communication Predictability in LLM Training
Wenxue Li, Xiangzhou Liu, Yuxuan Li, Yilun Jin, Zhenghang Ren, Xudong Liao, Han Tian, Bo Ren, Zhizhen Zhong, Guyue Liu, Ying Zhang, Kai Chen

TL;DR
This paper systematically analyzes communication predictability in LLM training, develops an analytical model for communication overhead, and introduces ConfigTuner, a tool that improves training throughput and reduces configuration search complexity.
Contribution
It provides the first systematic formulation of communication predictability in LLMs with hybrid parallelism and introduces ConfigTuner for optimized training configurations.
Findings
ConfigTuner increases throughput by up to 1.36× over Megatron-LM.
The analytical model accurately estimates communication overhead.
ConfigTuner reduces search complexity compared to Alpa.
Abstract
Effective communication is essential in distributed training, with predictability being one of its most significant characteristics. However, existing studies primarily focus on exploiting predictability through online profiling for runtime optimization, without a systematic understanding of it. In this work, we aim to systematically formulate communication predictability in distributed training, particularly in Large Language Models (LLMs) that utilize hybrid parallelism. Our analysis focuses on both traffic patterns and communication overhead. Specifically, we investigate predictable traffic patterns in typical LLMs and evaluate how various factors influence GPU utilization and effective bandwidth (two critical variables affecting communication overhead). Furthermore, we develop an analytical formulation to estimate communication overhead in LLM training, which is validated with high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Big Data and Digital Economy · Cloud Computing and Resource Management
