Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale
Zhaoxia (Summer) Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu,, Jie (Amy) Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie, Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish, Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh

TL;DR
This paper discusses how low-precision hardware architectures can be effectively used for large-scale recommendation model inference, highlighting strategies, optimizations, and tools developed to maintain accuracy and efficiency in production environments.
Contribution
The paper presents practical methods and tools for adapting recommendation models to low-precision hardware, enabling deployment of more complex models with improved efficiency in industry.
Findings
Low-precision architectures can handle complex recommendation models effectively.
Significant hardware and software optimizations are necessary for accuracy preservation.
Deployment of larger models leads to 5X capacity savings in data centers.
Abstract
Tremendous success of machine learning (ML) and the unabated growth in ML model complexity motivated many ML-specific designs in both CPU and accelerator architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Impressive compute throughputs are indeed often exhibited by these architectures on benchmark ML models. Nevertheless, production models such as recommendation systems important to Facebook's personalization services are demanding and complex: These systems must serve billions of users per month responsively with low latency while maintaining high prediction accuracy, notwithstanding computations with many tens of billions parameters per inference. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Stochastic Gradient Optimization Techniques
