GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Fanxu Meng

arXiv:2605.15250·cs.LG·May 18, 2026

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Fanxu Meng

PDF

TL;DR

GQLA introduces a hardware-adaptive attention mechanism that dynamically switches decoding paths to optimize large language model inference across different hardware platforms without retraining.

Contribution

It proposes GQLA, a minimal modification of MLA, enabling hardware-specific decoding paths and supporting tensor parallelism without retraining or custom kernels.

Findings

01

GQLA matches H100 roofline performance with minimal modifications.

02

Supports up to 8-way tensor parallelism on commodity GPUs.

03

Extends pretrained models to adapt to different hardware without retraining.

Abstract

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.