Efficient LLM Inference with Kcache

Qiaozhi He; Zhihua Wu

arXiv:2404.18057·cs.CL·April 30, 2024

Efficient LLM Inference with Kcache

Qiaozhi He, Zhihua Wu

PDF

Open Access

TL;DR

This paper introduces KCache, a novel technique that eliminates the memory overhead of KV Cache in LLM inference, boosting throughput by 40% without sacrificing accuracy.

Contribution

KCache is a new method that removes the need for KV Cache during inference, reducing memory use and increasing efficiency without retraining the models.

Findings

01

KCache improves LLM inference throughput by 40%.

02

KCache maintains the accuracy of models.

03

KCache eliminates the memory overhead of traditional KV Cache.

Abstract

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not necessary and proposed a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process. KCache can be used directly for inference without any training process, Our evaluations show that KCache improves the throughput of popular LLMs by 40% with the baseline, while keeping accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression