H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large   Language Models

Zhenyu Zhang; Ying Sheng; Tianyi Zhou; Tianlong Chen; Lianmin Zheng,; Ruisi Cai; Zhao Song; Yuandong Tian; Christopher R\'e; Clark Barrett,; Zhangyang Wang; Beidi Chen

arXiv:2306.14048·cs.LG·December 20, 2023·28 cites

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng,, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\'e, Clark Barrett,, Zhangyang Wang, Beidi Chen

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces H$_2$O, a novel method for reducing memory usage in large language model inference by selectively retaining the most impactful tokens, called Heavy Hitters, leading to significant speedups and efficiency improvements.

Contribution

The paper proposes a new KV cache eviction policy based on Heavy Hitters, with a theoretical guarantee, that improves inference efficiency for large language models.

Findings

01

H$_2$O improves throughput by up to 29x on OPT models.

02

H$_2$O reduces latency by up to 1.9x with the same batch size.

03

Heavy Hitters are naturally occurring tokens that dominate attention scores.

Abstract

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H $_{2}$ ). Through a comprehensive investigation, we find that (i) the emergence of H $_{2}$ is natural and strongly correlates with the frequent co-occurrence of tokens in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsOPT · GPT-NeoX