Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Guanbin Xu; ZhenGuo Xu; Yuzhe Li; Youhui Bai; Ping Gong; Chaoyi Ruan; Cheng Li

arXiv:2602.20656·cs.DC·February 25, 2026

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Guanbin Xu, ZhenGuo Xu, Yuzhe Li, Youhui Bai, Ping Gong, Chaoyi Ruan, Cheng Li

PDF

Open Access

TL;DR

Lagom is a system that optimizes communication and computation overlap in distributed large-model training, significantly improving training speed by intelligently tuning communication parameters using a cost model and search algorithm.

Contribution

Lagom introduces a unified cost model and a priority-based search algorithm to efficiently optimize communication parameters for overlapping communication and computation in distributed training.

Findings

01

Achieves 1.07-1.33x speedup over NCCL on high-bandwidth clusters.

02

Achieves 1.03-1.27x speedup over AutoCCL on diverse models.

03

Reduces optimization complexity from exponential to linear.

Abstract

Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Cloud Computing and Resource Management