Fast State Restoration in LLM Serving with HCache
Shiwei Gao,Youmin Chen,Jiwu Shu

TL;DR
HCache is a novel method for fast LLM state restoration that reduces latency and storage overhead by restoring from intermediate activations, optimizing computation and I/O balance.
Contribution
The paper introduces HCache, a new approach that restores LLM states from intermediate activations, improving efficiency over existing methods like KV offload and recomputation.
Findings
HCache reduces TTFT by up to 1.93X compared to KV offload.
HCache consumes 1.92-2.40X less storage space.
HCache achieves up to 5.73X TTFT reduction over token recomputation.
Abstract
The growing complexity of LLM usage today, e.g., multi-round conversation and retrieval-augmented generation (RAG), makes contextual states (i.e., KV cache) reusable across user requests. Given the capacity constraints of GPU memory, only a limited number of contexts can be cached on GPU for reusing. Existing inference systems typically evict part of the KV cache and restore it by recomputing it from the original tokens or offloading it to host storage for later retrieval, both of which introduce substantial computational or I/O overheads. We propose HCache, a novel LLM state restoration method. Its key idea is to restore LLM states from intermediate activations and thus utilize computational and I/O resources with low overhead. We enhance HCache with two techniques, including i) a bubble-free restoration scheduler that integrates resource-complementary methods to optimize the balance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced MRI Techniques and Applications · Advancements in Photolithography Techniques · Semiconductor materials and devices
