GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy,, Federico Lebr\'on, Sumit Sanghai

TL;DR
This paper introduces grouped-query attention (GQA), a generalization of multi-query attention, and a method to uptrain existing models with minimal additional compute, achieving near multi-head quality with MQA-like speed.
Contribution
It proposes GQA as a flexible attention mechanism and a recipe for uptraining existing models into GQA with minimal extra compute.
Findings
GQA achieves quality close to multi-head attention.
Uptraining with 5% of original compute is effective.
GQA maintains speed similar to MQA.
Abstract
Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bigcode/starcoder2-15bmodel· 5.2k dl· ♡ 6655.2k dl♡ 665
- 🤗dfurman/Llama-2-70B-Instruct-v0.1model· 21 dl· ♡ 1421 dl♡ 14
- 🤗Deci/DeciCoder-1bmodel· 1.4k dl· ♡ 2481.4k dl♡ 248
- 🤗Deci/DeciLM-6bmodel· 36 dl· ♡ 23236 dl♡ 232
- 🤗jradchenko/DeciCoder-1bmodel· 18 dl18 dl
- 🤗bigcode/starcoder2-3bmodel· 107k dl· ♡ 216107k dl♡ 216
- 🤗bigcode/starcoder2-7bmodel· 18k dl· ♡ 21018k dl♡ 210
- 🤗nold/starcoder2-3b-GGUFmodel· 178 dl· ♡ 1178 dl♡ 1
- 🤗nold/starcoder2-7b-GGUFmodel· 144 dl· ♡ 1144 dl♡ 1
- 🤗nold/starcoder2-15b-GGUFmodel· 150 dl· ♡ 1150 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Grouped-query attention · Multi-Query Attention · Softmax · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
