A GPU-specialized Inference Parameter Server for Large-Scale Deep   Recommendation Models

Yingcan Wei; Matthias Langer; Fan Yu; Minseok Lee; Kingsley Liu; Jerry; Shi; Joey Wang

arXiv:2210.08804·cs.IR·October 18, 2022

A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models

Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Kingsley Liu, Jerry, Shi, Joey Wang

PDF

1 Repo

TL;DR

This paper introduces HugeCTR Hierarchical Parameter Server (HPS), a GPU-optimized distributed framework that significantly reduces inference latency and increases throughput for large-scale recommendation models using a hierarchical storage and caching system.

Contribution

The paper presents a novel GPU-specialized inference framework with hierarchical storage and caching, enabling low-latency, high-throughput recommendation inference at large scale.

Findings

01

Achieves 5-62x speedup over CPU baselines.

02

Reduces end-to-end inference latency significantly.

03

Enables multi-GPU deployment for higher QPS.

Abstract

Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nvidia-merlin/hugectr
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.