TL;DR
This paper introduces HugeCTR Hierarchical Parameter Server (HPS), a GPU-optimized distributed framework that significantly reduces inference latency and increases throughput for large-scale recommendation models using a hierarchical storage and caching system.
Contribution
The paper presents a novel GPU-specialized inference framework with hierarchical storage and caching, enabling low-latency, high-throughput recommendation inference at large scale.
Findings
Achieves 5-62x speedup over CPU baselines.
Reduces end-to-end inference latency significantly.
Enables multi-GPU deployment for higher QPS.
Abstract
Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
