MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

Xin Wang; Chi Ma; Shaobin Chen; Pu Wang; Menglei Zhou; Junyi Qiu; Qiaorui Chen; Jiayu Sun; Shijie Liu; Zehuan Wang; Lei Yu; Chuan Liu; Fei Jiang; Wei Lin; Hao Wang; Jiawei Jiang; Xiao Yan

arXiv:2604.22881·cs.LG·April 28, 2026

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

Xin Wang, Chi Ma, Shaobin Chen, Pu Wang, Menglei Zhou, Junyi Qiu, Qiaorui Chen, Jiayu Sun, Shijie Liu, Zehuan Wang, Lei Yu, Chuan Liu, Fei Jiang, Wei Lin, Hao Wang, Jiawei Jiang, Xiao Yan

PDF

TL;DR

MTServe is a hierarchical cache system that virtualizes GPU memory with host RAM to efficiently serve generative recommendation models, significantly reducing inference costs and maintaining high cache hit ratios.

Contribution

It introduces MTServe, a scalable cache management system with system-level optimizations for generative recommendation inference.

Findings

01

Achieves up to 3.1x speedup on datasets.

02

Maintains cache hit ratios above 98.5%.

03

Effectively virtualizes GPU memory using host RAM.

Abstract

Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.