Cross-layer Attention Sharing for Pre-trained Large Language Models

Yongyu Mu; Yuzhang Wu; Yuchun Fan; Chenglong Wang; Hengyu Li; Jiali Zeng; Qiaozhi He; Murun Yang; Fandong Meng; Jie Zhou; Tong Xiao; Jingbo Zhu

arXiv:2408.01890·cs.CL·October 20, 2025

Cross-layer Attention Sharing for Pre-trained Large Language Models

Yongyu Mu, Yuzhang Wu, Yuchun Fan, Chenglong Wang, Hengyu Li, Jiali Zeng, Qiaozhi He, Murun Yang, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu

PDF

Open Access

TL;DR

This paper introduces LISA, a novel method that reduces redundancy in large language models' attention mechanisms by sharing weights across layers, leading to significant efficiency gains with minimal impact on performance.

Contribution

LISA employs lightweight networks and low-rank approximations to effectively share attention weights across layers, addressing previous challenges and improving efficiency in large language models.

Findings

01

Reduces redundant attention calculations by 53%-84%

02

Achieves 6x compression of Q and K matrices

03

Improves throughput by up to 40.1% in LLaMA models

Abstract

To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LISA, a lightweight substitute for self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · ALIGN