RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

Jiarui Wang; Huichao Chai; Yuanhang Zhang; Zongjin Zhou; Wei Guo; Xingkun Yang; Qiang Tang; Bo Pan; Jiawei Zhu; Ke Cheng; Yuting Yan; Shulan Wang; Yingjie Zhu; Zhengfan Yuan; Jiaqi Huang; Yuhan Zhang; Xiaosong Sun; Zhinan Zhang; Hong Zhu; Yongsheng Zhang; Tiantian Dong; Zhong Xiao; Deliang Liu; Chengzhou Lu; Yuan Sun; Zhiyuan Chen; Xinming Han; Zaizhu Liu; Yaoyuan Wang; Ziyang Zhang; Yong Liu; Jinxin Xu; Yajing Sun; Zhoujun Yu; Wenting Zhou; Qidong Zhang; Zhengyong Zhang; Zhonghai Gu; Yibo Jin; Yongxiang Feng; Pengfei Zuo

arXiv:2601.01712·cs.DC·January 6, 2026

RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

Jiarui Wang, Huichao Chai, Yuanhang Zhang, Zongjin Zhou, Wei Guo, Xingkun Yang, Qiang Tang, Bo Pan, Jiawei Zhu, Ke Cheng, Yuting Yan, Shulan Wang, Yingjie Zhu, Zhengfan Yuan, Jiaqi Huang, Yuhan Zhang, Xiaosong Sun, Zhinan Zhang, Hong Zhu, Yongsheng Zhang, Tiantian Dong

PDF

Open Access

TL;DR

RelayGR is a scalable system that pre-infers long user behavior sequences for generative recommendation, reducing latency and increasing throughput by efficiently caching and reusing inference results across pipeline stages.

Contribution

The paper introduces RelayGR, a novel production system that enables cross-stage relay-race inference for generative recommendation, addressing long sequence processing challenges at industrial scale.

Findings

01

Supports up to 1.5× longer sequences under fixed P99 SLO.

02

Improves SLO-compliant throughput by up to 3.6×.

03

Effectively manages cache footprint and resource utilization.

Abstract

Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Recommender Systems and Techniques · Advanced Neural Network Applications