Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Michael Lui; Yavuz Yetim; \"Ozg\"ur \"Ozkan; Zhuoran Zhao; Shin-Yeh; Tsai; Carole-Jean Wu; and Mark Hempstead

arXiv:2011.02084·cs.DC·November 13, 2020

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Michael Lui, Yavuz Yetim, \"Ozg\"ur \"Ozkan, Zhuoran Zhao, Shin-Yeh, Tsai, Carole-Jean Wu, and Mark Hempstead

PDF

TL;DR

This paper investigates distributed inference for large-scale deep learning recommendation models, showing minimal latency overhead and potential efficiency gains, thus guiding future system design for scalable recommendation serving.

Contribution

It characterizes scale-out recommendation inference, evaluates embedding strategies, and identifies trade-offs, providing foundational insights for developing efficient distributed serving solutions.

Findings

01

Marginal latency overhead with distributed inference (P99 latency increased by only 1%)

02

Latency and compute overheads are mainly due to embedding distribution and input sparsity

03

Distributed inference can improve resource efficiency in data-center recommendation serving

Abstract

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.