Training Recommender Systems at Scale: Communication-Efficient Model and   Data Parallelism

Vipul Gupta; Dhruv Choudhary; Ping Tak Peter Tang; Xiaohan Wei; Xing; Wang; Yuzhen Huang; Arun Kejariwal; Kannan Ramchandran; Michael W. Mahoney

arXiv:2010.08899·cs.LG·May 24, 2021

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

Vipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang, Xiaohan Wei, Xing, Wang, Yuzhen Huang, Arun Kejariwal, Kannan Ramchandran, Michael W. Mahoney

PDF

TL;DR

This paper introduces Dynamic Communication Thresholding (DCT), a compression framework for hybrid parallelism in training large recommendation models, significantly reducing communication overhead and improving training efficiency without performance loss.

Contribution

The paper presents a novel DCT framework that efficiently compresses communication in hybrid parallelism, enabling scalable training of large recommendation models with minimal overhead.

Findings

01

DCT reduces communication by over 100x in data parallelism.

02

DCT reduces communication by over 20x in model parallelism.

03

Deployment of DCT improved training time by 37% in production without performance loss.

Abstract

In this paper, we consider hybrid parallelism -- a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP) -- to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.