Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection
Dongxin Guo, Jikun Wu, and Siu Ming Yiu

TL;DR
This paper introduces an information-geometric framework for analyzing expert specialization in MoE models, providing theoretically grounded metrics that outperform existing heuristics in predicting training failure and correlating with performance.
Contribution
It offers the first rigorous geometric analysis of MoE specialization dynamics, proposing new metrics with strong theoretical justification and practical effectiveness.
Findings
Fisher Specialization Index correlates with downstream performance (r=0.91).
Fisher Heterogeneity Score predicts training failure with AUC=0.89.
Proposed metrics outperform validation-loss-based early stopping by 23%."
Abstract
Expert specialization is fundamental to Mixture-of-Experts (MoE) model success, yet existing metrics (cosine similarity, routing entropy) lack theoretical grounding and yield inconsistent conclusions under reparameterization. We present an information-geometric framework providing the first rigorous characterization of MoE specialization dynamics. Our key insight is that expert routing distributions evolve on the probability simplex equipped with the Fisher information metric, enabling formal analysis via Riemannian geometry. We prove that standard heuristic metrics violate parameterization invariance (Theorem 1), establish that specialization corresponds to geodesic flow with quantified approximation bounds (Theorem 2), and derive a failure predictor with theoretical threshold justification (Theorem 3). The framework introduces two principled metrics: Fisher Specialization Index (FSI)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
