Optimised Grouped-Query Attention Mechanism for Transformers
Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A., Constantinides, Yiren Zhao

TL;DR
This paper introduces AsymGQA, an activation-informed method for asymmetrically grouping queries in multi-head attention, improving performance and efficiency in large language models.
Contribution
The paper proposes AsymGQA, a novel approach to asymmetrically group queries in GQA, enhancing model accuracy without increasing computational costs.
Findings
AsymGQA outperforms traditional GQA in accuracy.
LLaMA-2-7B with AsymGQA improves MMLU accuracy by 7.5%.
Addresses the performance-efficiency trade-off in GQA.
Abstract
Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Graph Theory and Algorithms · Advanced Database Systems and Queries
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
