Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
Rishabh Jain, Vivek M. Bhasi, Adwait Jog, Anand Sivasubramaniam,, Mahmut T. Kandemir, Chita R. Das

TL;DR
This paper enhances GPU-based DLRM inference performance by identifying bottlenecks in embedding computations, applying compiler optimizations, and introducing prefetching and pinning techniques, achieving over 100% speedup.
Contribution
It introduces novel software prefetching and L2 pinning methods to significantly improve embedding stage performance in GPU DLRM inference.
Findings
Embedding stage is the main bottleneck in GPU DLRM inference.
Compiler optimizations can improve embedding kernel performance by up to 53%.
Proposed techniques yield up to 103% speedup in embedding and 77% overall inference.
Abstract
Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Recommender Systems and Techniques · Machine Learning in Healthcare
