Hercules: Heterogeneity-Aware Inference Serving for At-Scale   Personalized Recommendation

Liu Ke; Udit Gupta; Mark Hempstead; Carole-Jean Wu; Hsien-Hsin S. Lee,; Xuan Zhang

arXiv:2203.07424·cs.DC·March 16, 2022

Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation

Liu Ke, Udit Gupta, Mark Hempstead, Carole-Jean Wu, Hsien-Hsin S. Lee,, Xuan Zhang

PDF

TL;DR

Hercules is a framework that optimizes personalized recommendation inference serving in heterogeneous datacenters, significantly improving throughput, resource utilization, and power efficiency through a two-stage optimization process.

Contribution

Hercules introduces a novel two-stage optimization with gradient-based search and heterogeneity-aware cluster provisioning for recommendation serving.

Findings

01

Up to 9.0x latency-bounded throughput improvement.

02

47.7% cluster capacity saving.

03

23.7% power reduction.

Abstract

Personalized recommendation is an important class of deep-learning applications that powers a large collection of internet services and consumes a considerable amount of datacenter resources. As the scale of production-grade recommendation systems continues to grow, optimizing their serving performance and efficiency in a heterogeneous datacenter is important and can translate into infrastructure capacity saving. In this paper, we propose Hercules, an optimized framework for personalized recommendation inference serving that targets diverse industry-representative models and cloud-scale heterogeneous systems. Hercules performs a two-stage optimization procedure - offline profiling and online serving. The first stage searches the large under-explored task scheduling space with a gradient-based search algorithm achieving up to 9.0x latency-bounded throughput improvement on individual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.