Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks
Andrei Chernov

TL;DR
This paper investigates the behavior of experts in MoE-based LLMs on quiz tasks, revealing many experts are inactive, gating is nearly uniform, and expert performance varies widely.
Contribution
It provides the first detailed post-evaluation analysis of expert contributions in MoE LLMs on a benchmark.
Findings
Most experts were never activated during inference.
Gating network outputs are close to uniform, indicating lack of sparsity.
Expert performance varies significantly within the same layer.
Abstract
Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence · Business Process Modeling and Analysis · Software Engineering Techniques and Practices
MethodsMixture of Experts
