ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads
Zhuorui Liu, Chen Zhang, Dawei Song

TL;DR
ZigzagAttention introduces a method to improve long-context inference in large language models by exclusively assigning retrieval or streaming heads to each layer, reducing latency while maintaining performance.
Contribution
The paper proposes a novel head assignment criterion that enforces exclusive retrieval or streaming heads per layer, reducing latency and memory usage in long-context LLMs.
Findings
Reduced latency compared to baseline methods.
Negligible performance degradation with exclusive head assignment.
Competitive performance among considered baselines.
Abstract
With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV cache. There is certain work aiming to optimize the memory footprint of KV cache, inspired by the observation that attention heads can be categorized into retrieval heads that are of great significance and streaming heads that are of less significance. Typically, identifying the streaming heads and and waiving the KV cache in the streaming heads would largely reduce the overhead without hurting the performance that much. However, since employing both retrieval and streaming heads in one layer decomposes one large round of attention computation into two small ones, it may unexpectedly bring extra latency on accessing and indexing tensors.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Data Compression Techniques
