Beyond KV Caching: Shared Attention for Efficient LLMs
Bingli Liao, Danilo Vasconcellos Vargas

TL;DR
This paper proposes a Shared Attention mechanism that reduces computational and memory costs in large language models by sharing attention weights across layers, maintaining performance while improving efficiency.
Contribution
Introduces a novel Shared Attention mechanism that shares attention weights across layers, reducing resource usage in LLMs without significant accuracy loss.
Findings
Significant reduction in computational flops.
Decreased KV cache size during inference.
Minimal accuracy loss on standard benchmarks.
Abstract
The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant computational and memory resources due to the necessity of recalculating and storing attention weights across different layers. This paper introduces a novel Shared Attention (SA) mechanism, designed to enhance the efficiency of LLMs by directly sharing computed attention weights across multiple layers. Unlike previous methods that focus on sharing intermediate Key-Value (KV) caches, our approach utilizes the isotropic tendencies of attention distributions observed in advanced LLMs post-pretraining to reduce both the computational flops and the size of the KV cache required during inference. We empirically demonstrate that implementing SA across various LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cryptography and Data Security
MethodsSoftmax · Attention Is All You Need · Focus
