Quantized Inference for OneRec-V2

Yi Su; Xinchen Luo; Hongtao Cheng; Ziteng Shu; Yunfeng Zhao; Fangyu Zhang; Jiaqiang Liu; Xiao Liang; Yiwu Liu; Ruiming Tang

arXiv:2603.11486·cs.IR·March 13, 2026

Quantized Inference for OneRec-V2

Yi Su, Xinchen Luo, Hongtao Cheng, Ziteng Shu, Yunfeng Zhao, Fangyu Zhang, Jiaqiang Liu, Xiao Liang, Yiwu Liu, Ruiming Tang

PDF

Open Access

TL;DR

This paper demonstrates that low-precision FP8 quantization can significantly accelerate large-scale recommender system inference without quality loss, by leveraging the more controlled statistics and compute-intensive nature of modern recommendation models like OneRec-V2.

Contribution

The paper introduces a FP8 post-training quantization framework tailored for recommender systems, achieving substantial latency reduction and throughput increase while maintaining model quality.

Findings

01

49% reduction in inference latency

02

92% increase in throughput

03

No degradation in core metrics during online testing

Abstract

Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommender systems remains challenging in industrial settings. This difficulty arises from differences in training paradigms, architectural patterns, and computational characteristics, which lead to distinct numerical behaviors in weights and activations. Traditional recommender models often exhibit high-magnitude and high-variance weights and activations, making them more sensitive to quantization-induced perturbations. In addition, recommendation workloads frequently suffer from limited hardware utilization, limiting the practical gains of low-precision computation. In this work, we revisit low-precision inference in the context of generative recommendation. Through empirical distribution analysis,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Advanced Neural Network Applications · Multimodal Machine Learning Applications