KV Cache Compression, But What Must We Give in Return? A Comprehensive   Benchmark of Long Context Capable Approaches

Jiayi Yuan; Hongyi Liu; Shaochen Zhong; Yu-Neng Chuang; Songchen Li,; Guanchu Wang; Duy Le; Hongye Jin; Vipin Chaudhary; Zhaozhuo Xu; Zirui Liu,; Xia Hu

arXiv:2407.01527·cs.CL·October 10, 2024

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li,, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu,, Xia Hu

PDF

Open Access 1 Repo

TL;DR

This paper provides a comprehensive benchmark of various approaches to enable long context processing in large language models, revealing new insights and offering a valuable resource for future research.

Contribution

It introduces a taxonomy of methods and evaluates over ten approaches across multiple tasks, filling a gap in systematic benchmarking of long context techniques.

Findings

01

Revealed previously unknown phenomena in long context methods

02

Provided insights into the trade-offs of different approaches

03

Established a benchmark for future research in long context LLMs

Abstract

Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches - such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures - have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

henryzhongsc/longctx_bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies