Low-Precision Hardware Architectures Meet Recommendation Model Inference   at Scale

Zhaoxia (Summer) Deng; Jongsoo Park; Ping Tak Peter Tang; Haixin Liu,; Jie (Amy) Yang; Hector Yuen; Jianyu Huang; Daya Khudia; Xiaohan Wei; Ellie; Wen; Dhruv Choudhary; Raghuraman Krishnamoorthi; Carole-Jean Wu; Satish; Nadathur; Changkyu Kim; Maxim Naumov; Sam Naghshineh; Mikhail Smelyanskiy

arXiv:2105.12676·cs.LG·May 27, 2021

Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Zhaoxia (Summer) Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu,, Jie (Amy) Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie, Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish, Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh

PDF

Open Access

TL;DR

This paper discusses how low-precision hardware architectures can be effectively used for large-scale recommendation model inference, highlighting strategies, optimizations, and tools developed to maintain accuracy and efficiency in production environments.

Contribution

The paper presents practical methods and tools for adapting recommendation models to low-precision hardware, enabling deployment of more complex models with improved efficiency in industry.

Findings

01

Low-precision architectures can handle complex recommendation models effectively.

02

Significant hardware and software optimizations are necessary for accuracy preservation.

03

Deployment of larger models leads to 5X capacity savings in data centers.

Abstract

Tremendous success of machine learning (ML) and the unabated growth in ML model complexity motivated many ML-specific designs in both CPU and accelerator architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Impressive compute throughputs are indeed often exhibited by these architectures on benchmark ML models. Nevertheless, production models such as recommendation systems important to Facebook's personalization services are demanding and complex: These systems must serve billions of users per month responsively with low latency while maintaining high prediction accuracy, notwithstanding computations with many tens of billions parameters per inference. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Stochastic Gradient Optimization Techniques