QCQA: Quality and Capacity-aware grouped Query Attention

Vinay Joshi; Prashant Laddha; Shambhavi Sinha; Om Ji Omer; Sreenivas; Subramoney

arXiv:2406.10247·cs.CL·June 18, 2024

QCQA: Quality and Capacity-aware grouped Query Attention

Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas, Subramoney

PDF

Open Access

TL;DR

QCQA introduces a novel method for optimizing query head groupings in large language models, balancing memory efficiency and text generation quality through an evolutionary algorithm, outperforming existing approaches.

Contribution

The paper presents QCQA, a new quality and capacity-aware grouping method for query heads that improves the tradeoff between KV-cache size and model accuracy in LLM inference.

Findings

01

QCQA achieves 20% higher accuracy than GQA without fine-tuning.

02

Post fine-tuning, QCQA outperforms GQA by 10.55% in accuracy for similar KV-cache size.

03

QCQA reduces KV-cache size by 40% to maintain similar accuracy as GQA.

Abstract

Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Neural Networks and Applications · Cloud Computing and Resource Management

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Multi-Query Attention