Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Zohaib Khan; Muhammad Khaquan; Omer Tafveez; Burhanuddin Samiwala,; Agha Ali Raza

arXiv:2408.08454·cs.CV·August 29, 2024

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala,, Agha Ali Raza

PDF

Open Access 1 Repo

TL;DR

This paper introduces advanced variants of Grouped Query Attention (GQA) for Transformers, leveraging key head norms for adaptive grouping, which improves accuracy and reduces memory for long-sequence tasks, demonstrated on vision datasets.

Contribution

The paper proposes Key-Distributed GQA and Dynamic Key-Distributed GQA, enhancing GQA with norm-based adaptive query grouping, and introduces Perturbed GQA to add variability, advancing efficient attention mechanisms.

Findings

01

DGQA improves accuracy by up to 8% on ViT-L.

02

Norm-based grouping enhances model performance.

03

Adaptive grouping reduces memory without sacrificing accuracy.

Abstract

The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zohaib-khan5040/key-driven-gqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Data Stream Mining Techniques · Distributed systems and fault tolerance

MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax