Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA
Esmail Gumaan

TL;DR
The paper introduces MoAS, a dynamic attention scheme that chooses between MHA, GQA, and MQA for each token, balancing model quality and inference efficiency in Transformer models.
Contribution
MoAS is a novel architecture that learns to route between different attention schemes dynamically, improving performance and efficiency over static mixtures.
Findings
Dynamic routing outperforms static mixtures in validation loss.
MoAS achieves performance close to MHA baseline.
Code is publicly available for reproducibility.
Abstract
The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory requirements during inference. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage but often at the cost of model performance. In this work, we propose Mixture of Attention Schemes (MoAS), a novel architecture that dynamically selects the optimal attention scheme (MHA, GQA, or MQA) for each token via a learned router. We demonstrate that dynamic routing performs better than static averaging of schemes and achieves performance competitive with the MHA baseline while offering potential for conditional compute efficiency. Experimental results on WikiText-2 show that dynamic routing (val loss 2.3074) outperforms a static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Healthcare
