One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
Wenjun Yu, Shuguang Han, Amelie Chi Zhou

TL;DR
HELM adaptively manages GPU HBM partitioning and request routing for generative recommender inference, significantly reducing latency and improving SLO satisfaction across diverse workloads.
Contribution
The paper introduces HELM, a runtime system with PPO-based adaptive memory allocation and request routing to optimize HBM usage in generative recommender serving.
Findings
Reduces P99 latency by 24-38% compared to static policies.
Achieves 93.5-99.6% SLO satisfaction across workloads.
Maintains decision latency of 32 microseconds with near-optimal memory ratios.
Abstract
Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves decision latency while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
