Quantized Inference for OneRec-V2
Yi Su, Xinchen Luo, Hongtao Cheng, Ziteng Shu, Yunfeng Zhao, Fangyu Zhang, Jiaqiang Liu, Xiao Liang, Yiwu Liu, Ruiming Tang

TL;DR
This paper demonstrates that low-precision FP8 quantization can significantly accelerate large-scale recommender system inference without quality loss, by leveraging the more controlled statistics and compute-intensive nature of modern recommendation models like OneRec-V2.
Contribution
The paper introduces a FP8 post-training quantization framework tailored for recommender systems, achieving substantial latency reduction and throughput increase while maintaining model quality.
Findings
49% reduction in inference latency
92% increase in throughput
No degradation in core metrics during online testing
Abstract
Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommender systems remains challenging in industrial settings. This difficulty arises from differences in training paradigms, architectural patterns, and computational characteristics, which lead to distinct numerical behaviors in weights and activations. Traditional recommender models often exhibit high-magnitude and high-variance weights and activations, making them more sensitive to quantization-induced perturbations. In addition, recommendation workloads frequently suffer from limited hardware utilization, limiting the practical gains of low-precision computation. In this work, we revisit low-precision inference in the context of generative recommendation. Through empirical distribution analysis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Neural Network Applications · Multimodal Machine Learning Applications
