NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Zhida Jiang; Zhaolong Xing; Huichao Chai; Tianxing Sun; Qiang Peng; Baopeng Yuan; Jiaxing Wang; Hua Du; Zhixin Wu; Xuemiao Li; Yikui Cao; Xinyu Liu; Yongxiang Feng; Zhen Chen; Ke Zhang

arXiv:2604.06956·cs.DC·April 9, 2026

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Zhida Jiang, Zhaolong Xing, Huichao Chai, Tianxing Sun, Qiang Peng, Baopeng Yuan, Jiaxing Wang, Hua Du, Zhixin Wu, Xuemiao Li, Yikui Cao, Xinyu Liu, Yongxiang Feng, Zhen Chen, Ke Zhang

PDF

TL;DR

NestPipe is a scalable framework for recommendation model training that addresses both lookup and communication bottlenecks using nested pipelining, achieving significant speedups on large clusters.

Contribution

It introduces hierarchical sparse parallelism techniques, Dual-Buffer Pipelining and Frozen-Window Pipelining, to improve distributed training efficiency while maintaining synchronization.

Findings

01

Achieves up to 3.06x speedup on 1,536 workers.

02

Attains 94.07% scaling efficiency.

03

Effectively mitigates lookup and communication bottlenecks.

Abstract

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.