Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Qingyun Jin; Xiaohui Song; Feng Zhou; Zengchang Qin

arXiv:2412.20677·cs.CL·July 29, 2025

Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin

PDF

Open Access 1 Video

TL;DR

This paper introduces a cost-effective method using Procrustes analysis and L0 regularization to convert multi-head attention into grouped-query attention in large language models, significantly reducing KV heads while maintaining performance.

Contribution

The proposed method enables flexible conversion of MHA to GQA with high compression ratios, improving inference efficiency without substantial performance loss.

Findings

01

Compressed up to 87.5% KV heads in LLaMA2-7B

02

Achieved 75% KV head compression in Sheared-LLaMA-1.3B

03

Maintained acceptable performance after pruning

Abstract

Large language models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, as the model size and the input sequence's length increase, the linearly increasing key-value (KV) cache significantly degrades inference throughput. Therefore, grouped-query attention (GQA), as an alternative to multi-head attention (MHA), has been widely introduced into LLMs. In this work, we propose a cost-effective method for converting MHA into GQA with any compression ratio of KV heads. The key point of our method lies in the application of Procrustes analysis to the attention heads, which enhances the similarity among attention heads while preserving computational invariance, thereby improving the model's post-training performance. Subsequently, we employ $L_{0}$ regularization to prune redundant parameters. The model after pruning can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA· underline

Taxonomy

TopicsDesign Education and Practice

MethodsSoftmax · Attention Is All You Need · Pruning