DisaggRec: Architecting Disaggregated Systems for Large-Scale   Personalized Recommendation

Liu Ke; Xuan Zhang; Benjamin Lee; G. Edward Suh; Hsien-Hsin S. Lee

arXiv:2212.00939·cs.DC·December 5, 2022·5 cites

DisaggRec: Architecting Disaggregated Systems for Large-Scale Personalized Recommendation

Liu Ke, Xuan Zhang, Benjamin Lee, G. Edward Suh, Hsien-Hsin S. Lee

PDF

Open Access

TL;DR

DisaggRec introduces a disaggregated system architecture for large-scale recommendation serving, significantly reducing total cost of ownership and improving resource utilization and system reliability in evolving datacenter environments.

Contribution

It presents DisaggRec, a novel disaggregated system that enables independent scaling of compute and memory, reducing TCO and enhancing flexibility for recommendation workloads.

Findings

01

Up to 49.3% TCO reduction with disaggregation.

02

DisaggRec achieves 21%-43.6% TCO savings with new hardware.

03

Resource idleness in monolithic servers wastes up to 30% TCO.

Abstract

Deep learning-based personalized recommendation systems are widely used for online user-facing services in production datacenters, where a large amount of hardware resources are procured and managed to reliably provide low-latency services without disruption. As the recommendation models continue to evolve and grow in size, our analysis projects that datacenters deployed with monolithic servers will spend up to 12.4x total cost of ownership (TCO) to meet the requirement of model size and complexity over the next three years. Moreover, through in-depth characterization, we reveal that the monolithic server-based cluster suffers resource idleness and wastes up to 30% TCO by provisioning resources in fixed proportions. To address this challenge, we propose DisaggRec, a disaggregated system for large-scale recommendation serving. DisaggRec achieves the independent decoupled scaling-out of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Caching and Content Delivery