Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala,, Agha Ali Raza

TL;DR
This paper introduces advanced variants of Grouped Query Attention (GQA) for Transformers, leveraging key head norms for adaptive grouping, which improves accuracy and reduces memory for long-sequence tasks, demonstrated on vision datasets.
Contribution
The paper proposes Key-Distributed GQA and Dynamic Key-Distributed GQA, enhancing GQA with norm-based adaptive query grouping, and introduces Perturbed GQA to add variability, advancing efficient attention mechanisms.
Findings
DGQA improves accuracy by up to 8% on ViT-L.
Norm-based grouping enhances model performance.
Adaptive grouping reduces memory without sacrificing accuracy.
Abstract
The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Stream Mining Techniques · Distributed systems and fault tolerance
MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax
