Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization
Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao

TL;DR
mixSGA introduces a dynamic, token-wise mixture-of-experts approach for transformer models, optimizing memory and computation efficiency without discarding tokens, and outperforming static methods across multiple benchmarks.
Contribution
It proposes a novel mixture-of-expert method with token-wise routing and weight-sharing to dynamically optimize KV resources in transformers, improving efficiency and performance.
Findings
Achieves higher ROUGE-L scores on instruction-following tasks.
Reduces perplexity under fixed KV budgets.
Outperforms static baseline methods across multiple model families.
Abstract
Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face and Expression Recognition · Machine Learning and ELM
MethodsOPT
