TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation
Huichao Chai, Zhixin Wu, Xuemiao Li, Shiqing Fan, Hengfeng Wang, Maojun Peng, Lu Xu, Yaoyuan Wang, Yibo Jin, Wei Guo, Yongxiang Feng

TL;DR
TurboGR introduces an optimized training system for large-scale generative recommendation on Ascend NPUs, overcoming system bottlenecks with innovative acceleration, communication, and negative sampling techniques.
Contribution
The paper presents extit{TurboGR}, a system that systematically addresses Ascend NPU challenges for scalable generative recommendation training with three core innovations.
Findings
Supports training up to 0.2B parameters with high efficiency.
Achieves 54.71% MFU and near-linear scalability (0.97).
Reduces inter-device imbalance from 47% to 2.4%.
Abstract
Generative recommendation (GR) has emerged as a promising paradigm that replaces fragmented, scenario-specific architectures with unified Transformer-based models, exhibiting scaling-law behavior where recommendation quality improves systematically with increased model capacity and training data. However, deploying GR at scale on Ascend NPUs faces fundamental system-level challenges. These challenges are further exacerbated on Ascend NPUs due to the absence of high-performance implementations for jagged operators and the architectural mismatch between irregular sparse primitives and NPU's dense-computation-optimized design. In this paper, we present \model, an Ascend-affinity training system for generative recommendation that systematically addresses these bottlenecks through three core innovations: (i) Ascend-affinity jagged acceleration, including fusion operators that eliminate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
