SubGen: Token Generation in Sublinear Time and Memory

Amir Zandieh; Insu Han; Vahab Mirrokni; Amin Karbasi

arXiv:2402.06082·cs.LG·February 12, 2024·1 cites

SubGen: Token Generation in Sublinear Time and Memory

Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi

PDF

Open Access

TL;DR

SubGen introduces a novel sublinear time and memory attention decoding method for large language models by leveraging key clustering and online sampling, significantly improving efficiency in long-context token generation.

Contribution

It presents a new sublinear complexity caching technique for LLMs that uses online clustering and sampling, with proven accuracy and superior empirical performance.

Findings

01

SubGen achieves sublinear memory and time complexity.

02

It outperforms existing KV cache compression methods.

03

The approach has a tight theoretical error bound.

Abstract

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ℓ_{2}$ sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsFocus