Distributed Equivalent Substitution Training for Large-Scale Recommender   Systems

Haidong Rong; Yangzihao Wang; Feihu Zhou; Junjie Zhai; Haiyang Wu; Rui; Lan; Fan Li; Han Zhang; Yuekui Yang; Zhenyu Guo; Di Wang

arXiv:1909.04823·cs.LG·June 2, 2020

Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Haidong Rong, Yangzihao Wang, Feihu Zhou, Junjie Zhai, Haiyang Wu, Rui, Lan, Fan Li, Han Zhang, Yuekui Yang, Zhenyu Guo, Di Wang

PDF

TL;DR

This paper introduces DES, a fully synchronous distributed training framework for large-scale recommender systems that reduces communication costs and improves convergence and accuracy.

Contribution

DES is the first to enable fully synchronous training for large-scale recommender systems, significantly reducing communication and enhancing model performance.

Findings

01

Achieves up to 68.7% communication savings.

02

Outperforms PS-based frameworks in throughput.

03

Improves CTR and AUC in industrial scenarios.

Abstract

We present Distributed Equivalent Substitution (DES) training, a novel distributed training framework for large-scale recommender systems with dynamic sparse features. DES introduces fully synchronous training to large-scale recommendation system for the first time by reducing communication, thus making the training of commercial recommender systems converge faster and reach better CTR. DES requires much less communication by substituting the weights-rich operators with the computationally equivalent sub-operators and aggregating partial results instead of transmitting the huge sparse weights directly through the network. Due to the use of synchronous training on large-scale Deep Learning Recommendation Models (DLRMs), DES achieves higher AUC(Area Under ROC). We successfully apply DES training on multiple popular DLRMs of industrial scenarios. Experiments show that our implementation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.