Understanding Capacity-Driven Scale-Out Neural Recommendation Inference
Michael Lui, Yavuz Yetim, \"Ozg\"ur \"Ozkan, Zhuoran Zhao, Shin-Yeh, Tsai, Carole-Jean Wu, and Mark Hempstead

TL;DR
This paper investigates distributed inference for large-scale deep learning recommendation models, showing minimal latency overhead and potential efficiency gains, thus guiding future system design for scalable recommendation serving.
Contribution
It characterizes scale-out recommendation inference, evaluates embedding strategies, and identifies trade-offs, providing foundational insights for developing efficient distributed serving solutions.
Findings
Marginal latency overhead with distributed inference (P99 latency increased by only 1%)
Latency and compute overheads are mainly due to embedding distribution and input sparsity
Distributed inference can improve resource efficiency in data-center recommendation serving
Abstract
Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
