SD-MoE: Spectral Decomposition for Effective Expert Specialization

Ruijun Huang; Fang Dong; Xin Zhang; Hengjie Cao; Zhendong Huang; Anrui Chen; Jixian Zhou; Mengyi Chen; Yifeng Yang; Mingzhi Dong; Yujiang Wang; Jinlong Hou; Qin Lv; Robert P. Dick; Yuan Cheng; Fan Yang; Tun Lu; Chun Zhang; Li Shang

arXiv:2602.12556·cs.LG·February 16, 2026

SD-MoE: Spectral Decomposition for Effective Expert Specialization

Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang

PDF

Open Access

TL;DR

SD-MoE introduces a spectral decomposition approach to enhance expert specialization in Mixture-of-Experts models, addressing spectral overlap issues to improve model capacity and performance with minimal extra computation.

Contribution

The paper presents a novel spectral decomposition method for MoE models that improves expert specialization and overall performance by addressing spectral overlap and alignment issues.

Findings

01

Spectral overlap among experts limits specialization.

02

Spectral decomposition improves expert differentiation.

03

SD-MoE enhances downstream task performance.

Abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing