Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards
Youngeun Kwon, Minsoo Rhu

TL;DR
This paper introduces ScratchPipe, a novel GPU-based embedding cache architecture for personalized recommendation systems that captures both past and future accesses, enabling faster training without relying on large CPU memory.
Contribution
It proposes a new embedding cache design that leverages RecSys training properties to keep the active working set in GPU memory, overcoming previous limitations.
Findings
Enables GPU memory-speed training of embeddings
Reduces memory bandwidth bottlenecks in RecSys training
Outperforms existing cache-based approaches
Abstract
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Stochastic Gradient Optimization Techniques · Caching and Content Delivery
