BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He

TL;DR
BUZZ introduces a beehive-structured sparse key-value cache for large language models, significantly reducing memory usage and increasing inference speed while maintaining high accuracy in various NLP tasks.
Contribution
The paper presents BUZZ, a novel KV caching algorithm with a beehive structure that efficiently segments tokens to improve LLM inference performance.
Findings
Reduces cache memory by 2.5x in LLM inference.
Maintains over 99% accuracy in long-text summarization.
Outperforms state-of-the-art in multi-document question answering by 7.69%.
Abstract
Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Advanced Data Storage Technologies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
