BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters   for Efficient LLM Inference

Junqi Zhao; Zhijin Fang; Shu Li; Shaohui Yang; Shichao He

arXiv:2410.23079·cs.CL·October 31, 2024

BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He

PDF

Open Access 1 Repo

TL;DR

BUZZ introduces a beehive-structured sparse key-value cache for large language models, significantly reducing memory usage and increasing inference speed while maintaining high accuracy in various NLP tasks.

Contribution

The paper presents BUZZ, a novel KV caching algorithm with a beehive structure that efficiently segments tokens to improve LLM inference performance.

Findings

01

Reduces cache memory by 2.5x in LLM inference.

02

Maintains over 99% accuracy in long-text summarization.

03

Outperforms state-of-the-art in multi-document question answering by 7.69%.

Abstract

Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $2.5 \times$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junqizhao888/buzz-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Advanced Data Storage Technologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings