Efficient LLM Inference with Kcache
Qiaozhi He, Zhihua Wu

TL;DR
This paper introduces KCache, a novel technique that eliminates the memory overhead of KV Cache in LLM inference, boosting throughput by 40% without sacrificing accuracy.
Contribution
KCache is a new method that removes the need for KV Cache during inference, reducing memory use and increasing efficiency without retraining the models.
Findings
KCache improves LLM inference throughput by 40%.
KCache maintains the accuracy of models.
KCache eliminates the memory overhead of traditional KV Cache.
Abstract
Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not necessary and proposed a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process. KCache can be used directly for inference without any training process, Our evaluations show that KCache improves the throughput of popular LLMs by 40% with the baseline, while keeping accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
