FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

Chenhao Feng; Haoli Zhang; Shakhzod Ali-Zade; Yanli Zhao; Liang Luo; Jennifer Cao; Lisen Deng; Siqiao Chen; Chenyu Zhao; Tristan Rice; Daniel Johnson; Min Si; Tiantu Xu; Yi Zhang; Siqi Yan; Chuanhao Zhuge; Min Ni; Bi Xue; Qunshu Zhang; Shen Li

arXiv:2604.24073·cs.LG·April 28, 2026

FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

Chenhao Feng, Haoli Zhang, Shakhzod Ali-Zade, Yanli Zhao, Liang Luo, Jennifer Cao, Lisen Deng, Siqiao Chen, Chenyu Zhao, Tristan Rice, Daniel Johnson, Min Si, Tiantu Xu, Yi Zhang, Siqi Yan, Chuanhao Zhuge, Min Ni, Bi Xue, Qunshu Zhang, Shen Li

PDF

TL;DR

FreeScale is a distributed training method for sequence recommendation models that significantly reduces computational inefficiencies and resource under-utilization on large GPU clusters.

Contribution

It introduces load balancing, communication overlapping, and SM-Free techniques to address stragglers and communication bottlenecks in large-scale training.

Findings

01

Achieves up to 90.3% reduction in computational bubbles.

02

Effectively mitigates stragglers and communication delays.

03

Improves GPU resource utilization in real-world workloads.

Abstract

Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the straggler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.