GQA: Training Generalized Multi-Query Transformer Models from Multi-Head   Checkpoints

Joshua Ainslie; James Lee-Thorp; Michiel de Jong; Yury Zemlyanskiy,; Federico Lebr\'on; Sumit Sanghai

arXiv:2305.13245·cs.CL·December 27, 2023·27 cites

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy,, Federico Lebr\'on, Sumit Sanghai

PDF

Open Access 4 Repos 10 Models

TL;DR

This paper introduces grouped-query attention (GQA), a generalization of multi-query attention, and a method to uptrain existing models with minimal additional compute, achieving near multi-head quality with MQA-like speed.

Contribution

It proposes GQA as a flexible attention mechanism and a recipe for uptraining existing models into GQA with minimal extra compute.

Findings

01

GQA achieves quality close to multi-head attention.

02

Uptraining with 5% of original compute is effective.

03

GQA maintains speed similar to MQA.

Abstract

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Grouped-query attention · Multi-Query Attention · Softmax · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings