Accelerating Communication in Deep Learning Recommendation Model   Training with Dual-Level Adaptive Lossy Compression

Hao Feng; Boyuan Zhang; Fanjiang Ye; Min Si; Ching-Hsiang Chu; Jiannan; Tian; Chunxing Yin; Summer Deng; Yuchen Hao; Pavan Balaji; Tong Geng; Dingwen; Tao

arXiv:2407.04272·cs.LG·October 2, 2024

Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

Hao Feng, Boyuan Zhang, Fanjiang Ye, Min Si, Ching-Hsiang Chu, Jiannan, Tian, Chunxing Yin, Summer Deng, Yuchen Hao, Pavan Balaji, Tong Geng, Dingwen, Tao

PDF

Open Access

TL;DR

This paper presents a dual-level adaptive lossy compression method that significantly reduces communication overhead in DLRM training, leading to faster training times with minimal accuracy loss.

Contribution

We introduce a novel error-bounded lossy compression algorithm with a dual-level adaptive strategy tailored for DLRM training on GPUs.

Findings

01

Achieves 1.38× training speedup

02

Maintains high model accuracy with minimal impact

03

Optimized for PyTorch GPU tensors

Abstract

DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training. We develop a novel error-bounded lossy compression algorithm, informed by an in-depth analysis of embedding data features, to achieve high compression ratios. Moreover, we introduce a dual-level adaptive strategy for error-bound adjustment, spanning both table-wise and iteration-wise aspects, to balance the compression benefits with the potential impacts on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques