Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap
Feilong Liu

TL;DR
This paper introduces a geometric framework to analyze expert specialization in MoE Transformer models, revealing a coexistence of functional decorrelation and representational overlap influenced by routing sparsity.
Contribution
It provides a unified Jacobian-PCA-Grassmann analysis framework and uncovers the geometric structure of MoE layers across pretrained models and routing strategies.
Findings
Experts show strong functional decorrelation with low Jacobian alignment.
Routing sparsity influences the degree of functional separation and subspace divergence.
MoE layers can be viewed as decorrelated operators over overlapping submanifolds.
Abstract
Mixture-of-Experts (MoE) architectures achieve scalable capacity through sparse routing, yet the geometric structure of expert specialization remains poorly understood. We introduce a unified Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space and representation space. Across pretrained MoE Transformers (Mistral, Qwen), we find a consistent structural asymmetry: experts exhibit strong functional decorrelation (consistently low, near-zero cross-expert Jacobian alignment) while their routed representations occupy distinct but partially overlapping subspaces. This indicates that functional decorrelation and representation overlap coexist rather than coincide in MoE specialization. Controlled routing experiments further indicate that routing sparsity appears to be a key factor shaping this geometry: top-k routing induces sharper functional separation and larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
