RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

Zhan Zhao; Yuxin Wang; Amelie Chi Zhou

arXiv:2605.07443·cs.DC·May 11, 2026

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

Zhan Zhao, Yuxin Wang, Amelie Chi Zhou

PDF

TL;DR

RcLLM introduces a distributed inference system that accelerates generative recommendation by decomposing prompts into reusable blocks and optimizing caching and attention mechanisms, significantly reducing latency.

Contribution

The paper proposes RcLLM, a novel system that enhances generative recommendation efficiency through Beyond-Prefix KV Caching and stratified distributed storage, enabling real-time deployment.

Findings

01

Reduces Time-To-First-Token (TTFT) by up to 9.51x.

02

Supports large item catalogs with a stratified distributed storage design.

03

Maintains recommendation accuracy while significantly improving latency.

Abstract

Large Language Models (LLMs) are transforming recommendation from ranking into a generative task, but industrial deployment remains limited by the high latency of processing long, personalized prompts. Standard prefix caching provides limited benefit because reuse in recommendation workloads is often non-contiguous across user histories and item contexts. We present RcLLM, a distributed inference system for generative recommendation with Beyond-Prefix KV Caching. RcLLM decomposes prompts into reusable blocks and supports large item catalogs with a stratified distributed storage design: compact user-history caches are replicated for zero-latency retrieval, while massive item caches are sharded using similarity-aware placement. To reduce redundant quadratic attention computation, RcLLM combines an affinity-based global scheduler that improves data locality with a selective attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.