Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen; Cheng Zhang; Xitong Gao; Robert D. Mullins; George A.; Constantinides; Yiren Zhao

arXiv:2406.14963·cs.LG·June 24, 2024

Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A., Constantinides, Yiren Zhao

PDF

Open Access

TL;DR

This paper introduces AsymGQA, an activation-informed method for asymmetrically grouping queries in multi-head attention, improving performance and efficiency in large language models.

Contribution

The paper proposes AsymGQA, a novel approach to asymmetrically group queries in GQA, enhancing model accuracy without increasing computational costs.

Findings

01

AsymGQA outperforms traditional GQA in accuracy.

02

LLaMA-2-7B with AsymGQA improves MMLU accuracy by 7.5%.

03

Addresses the performance-efficiency trade-off in GQA.

Abstract

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Graph Theory and Algorithms · Advanced Database Systems and Queries

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention