Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Guanghui Song; Dongping Liao; Yiren Zhao; Kejiang Ye; Cheng-zhong Xu; Xitong Gao

arXiv:2506.13541·cs.CL·June 17, 2025

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao

PDF

Open Access 1 Video

TL;DR

mixSGA introduces a dynamic, token-wise mixture-of-experts approach for transformer models, optimizing memory and computation efficiency without discarding tokens, and outperforming static methods across multiple benchmarks.

Contribution

It proposes a novel mixture-of-expert method with token-wise routing and weight-sharing to dynamically optimize KV resources in transformers, improving efficiency and performance.

Findings

01

Achieves higher ROUGE-L scores on instruction-following tasks.

02

Reduces perplexity under fixed KV budgets.

03

Outperforms static baseline methods across multiple model families.

Abstract

Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization· underline

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Face and Expression Recognition · Machine Learning and ELM

MethodsOPT