TL;DR
This paper identifies a small subset of experts, called Super Experts, in Mixture-of-Experts large language models that are crucial for performance, especially in reasoning tasks, and demonstrates their unique activation patterns and importance.
Contribution
It introduces the concept of Super Experts in MoE LLMs, characterizes their activation patterns, and shows their critical role in model performance and internal dynamics.
Findings
Pruning Super Experts significantly degrades performance.
Super Experts exhibit extreme activation outliers.
Compressing Super Experts disrupts the model's attention mechanisms.
Abstract
In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the MoE LLMs' forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. (ii) By pruning…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper presents the first rigorous identification and mechanistic explanation of "Super Experts" (SEs)—the primary source of systematic outlier phenomena, whose removal severely impairs performance, particularly in reasoning tasks. 2. By linking expert-level routing dynamics to Transformer-wide outlier mechanisms in MoE models, this work offers a principled account of known empirical phenomena and reveals critical vulnerabilities in current expert compression methods.
The identification of SEs heavily relies on an empirical threshold defined in Equation (6), where experts are selected based on exceeding the 99.5-percentile activation magnitude. This criterion appears rather extreme and ad-hoc, raising concerns that the observed stability of SEs may stem from the definition itself rather than reflecting an intrinsic property of the model.
1. **Interesting Empirical Finding**: The paper identifies a clear and striking phenomenon: a tiny, identifiable subset of experts is responsible for the vast majority of model stability. The empirical evidence, particularly the stark contrast between pruning SEs versus a large number of random experts (Fig. 1, Tables 3-5), is convincing. 2. **Plausible Mechanistic Explanation**: The paper provides a commendable in-depth analysis of why these experts are so critical. It successfully links the M
1. **Is this a new discovery or a restatement?** The paper's core finding is that SEs cause MAs, which in turn create attention sinks. The phenomenon and importance of massive activation are already known in the established literature. The paper's main contribution seems to be locating the source of MAs within specific experts in MoE models. In appendix H, the author also provides analysis on locating superweights within super experts, causing massive activation. It's unclear to me whether "Supe
This work is the first to analyze the source of massive activations (MA) in Mixture-of-Experts (MoE) LLMs, providing a quantitative definition for this source by introducing the concept of Super Experts (SEs). The paper is well-written and easy to follow, and its core claims are supported by extensive experimental evidence.
See questions below.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
