Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

Esmail Gumaan

arXiv:2512.20650·cs.AI·December 25, 2025

Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

Esmail Gumaan

PDF

Open Access

TL;DR

The paper introduces MoAS, a dynamic attention scheme that chooses between MHA, GQA, and MQA for each token, balancing model quality and inference efficiency in Transformer models.

Contribution

MoAS is a novel architecture that learns to route between different attention schemes dynamically, improving performance and efficiency over static mixtures.

Findings

01

Dynamic routing outperforms static mixtures in validation loss.

02

MoAS achieves performance close to MHA baseline.

03

Code is publicly available for reproducibility.

Abstract

The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory requirements during inference. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage but often at the cost of model performance. In this work, we propose Mixture of Attention Schemes (MoAS), a novel architecture that dynamically selects the optimal attention scheme (MHA, GQA, or MQA) for each token via a learned router. We demonstrate that dynamic routing performs better than static averaging of schemes and achieves performance competitive with the MHA baseline while offering potential for conditional compute efficiency. Experimental results on WikiText-2 show that dynamic routing (val loss 2.3074) outperforms a static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Healthcare