A Frequency-aware Software Cache for Large Recommendation System Embeddings
Jiarui Fang, Geng Zhang, Jiatong Han, Shenggui Li, Zhengda, Bian, Yongbin Li, Jin Liu, Yang You

TL;DR
This paper introduces a frequency-aware software cache for large recommendation system embeddings, enabling efficient GPU training by dynamically managing embedding data between CPU and GPU memory.
Contribution
It presents a novel GPU-based cache approach that leverages frequency statistics to optimize embedding management in DLRMs, scalable to multiple GPUs.
Findings
Maintains only 1.5% of embeddings in GPU for effective training.
Achieves efficient training speed with minimal GPU memory usage.
Supports synchronized updates and multi-GPU scaling.
Abstract
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies. The embedding tables of DLRMs are too large to fit on GPU memory entirely. We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space by leveraging the id's frequency statistics of the target dataset. Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner. It is also scaled to multiple GPUs in combination with the widely used hybrid parallel training approaches. Evaluating our prototype system shows that we can keep only 1.5% of the embedding parameters in the GPU to obtain a decent end-to-end training speed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
