H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng,, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\'e, Clark Barrett,, Zhangyang Wang, Beidi Chen

TL;DR
This paper introduces H$_2$O, a novel method for reducing memory usage in large language model inference by selectively retaining the most impactful tokens, called Heavy Hitters, leading to significant speedups and efficiency improvements.
Contribution
The paper proposes a new KV cache eviction policy based on Heavy Hitters, with a theoretical guarantee, that improves inference efficiency for large language models.
Findings
H$_2$O improves throughput by up to 29x on OPT models.
H$_2$O reduces latency by up to 1.9x with the same batch size.
Heavy Hitters are naturally occurring tokens that dominate attention scores.
Abstract
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H). Through a comprehensive investigation, we find that (i) the emergence of H is natural and strongly correlates with the frequent co-occurrence of tokens in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsOPT · GPT-NeoX
