QCQA: Quality and Capacity-aware grouped Query Attention
Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas, Subramoney

TL;DR
QCQA introduces a novel method for optimizing query head groupings in large language models, balancing memory efficiency and text generation quality through an evolutionary algorithm, outperforming existing approaches.
Contribution
The paper presents QCQA, a new quality and capacity-aware grouping method for query heads that improves the tradeoff between KV-cache size and model accuracy in LLM inference.
Findings
QCQA achieves 20% higher accuracy than GQA without fine-tuning.
Post fine-tuning, QCQA outperforms GQA by 10.55% in accuracy for similar KV-cache size.
QCQA reduces KV-cache size by 40% to maintain similar accuracy as GQA.
Abstract
Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Neural Networks and Applications · Cloud Computing and Resource Management
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Multi-Query Attention
