Beyond KV Caching: Shared Attention for Efficient LLMs

Bingli Liao; Danilo Vasconcellos Vargas

arXiv:2407.12866·cs.CL·July 19, 2024

Beyond KV Caching: Shared Attention for Efficient LLMs

Bingli Liao, Danilo Vasconcellos Vargas

PDF

Open Access 1 Repo

TL;DR

This paper proposes a Shared Attention mechanism that reduces computational and memory costs in large language models by sharing attention weights across layers, maintaining performance while improving efficiency.

Contribution

Introduces a novel Shared Attention mechanism that shares attention weights across layers, reducing resource usage in LLMs without significant accuracy loss.

Findings

01

Significant reduction in computational flops.

02

Decreased KV cache size during inference.

03

Minimal accuracy loss on standard benchmarks.

Abstract

The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant computational and memory resources due to the necessity of recalculating and storing attention weights across different layers. This paper introduces a novel Shared Attention (SA) mechanism, designed to enhance the efficiency of LLMs by directly sharing computed attention weights across multiple layers. Unlike previous methods that focus on sharing intermediate Key-Value (KV) caches, our approach utilizes the isotropic tendencies of attention distributions observed in advanced LLMs post-pretraining to reduce both the computational flops and the size of the KV cache required during inference. We empirically demonstrate that implementing SA across various LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

metacarbon/shareAtt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cryptography and Data Security

MethodsSoftmax · Attention Is All You Need · Focus