GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Fanxu Meng

TL;DR
GQLA introduces a hardware-adaptive attention mechanism that dynamically switches decoding paths to optimize large language model inference across different hardware platforms without retraining.
Contribution
It proposes GQLA, a minimal modification of MLA, enabling hardware-specific decoding paths and supporting tensor parallelism without retraining or custom kernels.
Findings
GQLA matches H100 roofline performance with minimal modifications.
Supports up to 8-way tensor parallelism on commodity GPUs.
Extends pretrained models to adapt to different hardware without retraining.
Abstract
Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
