NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
Zhida Jiang, Zhaolong Xing, Huichao Chai, Tianxing Sun, Qiang Peng, Baopeng Yuan, Jiaxing Wang, Hua Du, Zhixin Wu, Xuemiao Li, Yikui Cao, Xinyu Liu, Yongxiang Feng, Zhen Chen, Ke Zhang

TL;DR
NestPipe is a scalable framework for recommendation model training that addresses both lookup and communication bottlenecks using nested pipelining, achieving significant speedups on large clusters.
Contribution
It introduces hierarchical sparse parallelism techniques, Dual-Buffer Pipelining and Frozen-Window Pipelining, to improve distributed training efficiency while maintaining synchronization.
Findings
Achieves up to 3.06x speedup on 1,536 workers.
Attains 94.07% scaling efficiency.
Effectively mitigates lookup and communication bottlenecks.
Abstract
Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
