Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on   Long-Context Tasks

Zheng Wang; Boxiao Jin; Zhongzhi Yu; Minjia Zhang

arXiv:2407.08454·cs.CL·July 23, 2024·1 cites

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang

PDF

Open Access

TL;DR

This paper introduces KVMerger, an adaptive KV cache merging method that compresses cache data for large language models, enabling efficient long-context processing with limited memory while maintaining high performance.

Contribution

The paper presents a novel, dataset-independent KV cache merging algorithm that improves long-context LLM performance under memory constraints without significant accuracy loss.

Findings

01

KVMerger outperforms existing methods like H2O and CaM in long-context tasks.

02

It maintains high performance at 50% and 35% KV cache budgets.

03

Effective for models including Llama2-7B-chat and Llama2-13B-chat.

Abstract

How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems

MethodsSparse Evolutionary Training