RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
Zhan Zhao, Yuxin Wang, Amelie Chi Zhou

TL;DR
RcLLM introduces a distributed inference system that accelerates generative recommendation by decomposing prompts into reusable blocks and optimizing caching and attention mechanisms, significantly reducing latency.
Contribution
The paper proposes RcLLM, a novel system that enhances generative recommendation efficiency through Beyond-Prefix KV Caching and stratified distributed storage, enabling real-time deployment.
Findings
Reduces Time-To-First-Token (TTFT) by up to 9.51x.
Supports large item catalogs with a stratified distributed storage design.
Maintains recommendation accuracy while significantly improving latency.
Abstract
Large Language Models (LLMs) are transforming recommendation from ranking into a generative task, but industrial deployment remains limited by the high latency of processing long, personalized prompts. Standard prefix caching provides limited benefit because reuse in recommendation workloads is often non-contiguous across user histories and item contexts. We present RcLLM, a distributed inference system for generative recommendation with Beyond-Prefix KV Caching. RcLLM decomposes prompts into reusable blocks and supports large item catalogs with a stratified distributed storage design: compact user-history caches are replicated for zero-latency retrieval, while massive item caches are sharded using similarity-aware placement. To reduce redundant quadratic attention computation, RcLLM combines an affinity-based global scheduler that improves data locality with a selective attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
