NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

Yejing Wang; Shengyu Zhou; Jinyu Lu; Ziwei Liu; Langming Liu; Maolin Wang; Wenlin Zhang; Feng Li; Wenbo Su; Pengjie Wang; Jian Xu; Xiangyu Zhao

arXiv:2511.18793·cs.AI·February 4, 2026

NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

Yejing Wang, Shengyu Zhou, Jinyu Lu, Ziwei Liu, Langming Liu, Maolin Wang, Wenlin Zhang, Feng Li, Wenbo Su, Pengjie Wang, Jian Xu, Xiangyu Zhao

PDF

Open Access

TL;DR

NEZHA is a novel decoding architecture for generative recommendation systems that significantly reduces inference latency without compromising quality, enabling real-time industrial applications.

Contribution

NEZHA introduces a self-drafting autoregressive head and a hash set verifier, achieving hyperspeed decoding for large language model-based recommendations.

Findings

01

Achieves hyperspeed decoding without quality loss

02

Successfully deployed on Taobao, boosting advertising revenue

03

Serves hundreds of millions of users daily

Abstract

Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)