Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He,, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao,, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang,, Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu

TL;DR
Persia is a novel distributed system and hybrid training algorithm designed to efficiently train extremely large-scale recommender models up to 100 trillion parameters, addressing memory and computation challenges.
Contribution
The paper introduces Persia, a system combining a new hybrid training algorithm with system co-design to enable scalable training of trillion-parameter models.
Findings
Successfully trained models with up to 100 trillion parameters.
Demonstrated improved training efficiency and accuracy.
System is publicly available for broader use.
Abstract
Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation--the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Recommender Systems and Techniques
