Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA
Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin

TL;DR
This paper introduces a cost-effective method using Procrustes analysis and L0 regularization to convert multi-head attention into grouped-query attention in large language models, significantly reducing KV heads while maintaining performance.
Contribution
The proposed method enables flexible conversion of MHA to GQA with high compression ratios, improving inference efficiency without substantial performance loss.
Findings
Compressed up to 87.5% KV heads in LLaMA2-7B
Achieved 75% KV head compression in Sheared-LLaMA-1.3B
Maintained acceptable performance after pruning
Abstract
Large language models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, as the model size and the input sequence's length increase, the linearly increasing key-value (KV) cache significantly degrades inference throughput. Therefore, grouped-query attention (GQA), as an alternative to multi-head attention (MHA), has been widely introduced into LLMs. In this work, we propose a cost-effective method for converting MHA into GQA with any compression ratio of KV heads. The key point of our method lies in the application of Procrustes analysis to the attention heads, which enhances the similarity among attention heads while preserving computational invariance, thereby improving the model's post-training performance. Subsequently, we employ regularization to prune redundant parameters. The model after pruning can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDesign Education and Practice
MethodsSoftmax · Attention Is All You Need · Pruning
