Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts
Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li

TL;DR
Grove MoE introduces a heterogeneous expert architecture with dynamic activation, improving efficiency and performance of large language models by activating only relevant parameters based on input complexity.
Contribution
The paper proposes Grove MoE, a novel architecture with experts of varying sizes and a dynamic activation mechanism, enhancing efficiency and scalability of MoE-based LLMs.
Findings
Grove MoE models activate 3.14-3.28B parameters based on token complexity.
Grove MoE models achieve comparable performance to larger SOTA models.
Dynamic expert activation improves computational efficiency.
Abstract
The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is generally well-written and clearly structured, making the proposed method easy to follow. The experimental evaluation is fairly comprehensive, covering multiple benchmarks and including relevant ablation studies to analyze the impact of key design choices.
* **Prior Works Comparison**: Allocating different expert capacities via structural scaling is related to a line of recent works [1,2] that explore heterogeneous expert sizes within MoE layers. However, the paper does not compare against or discuss these prior approaches. * **Loading Balance Sensitivity**: While Grove MoE incorporates a load-balancing mechanism, the paper does not sufficiently address whether its heterogeneous expert structure is more sensitive to imbalance compared to standar
- The Grove MoE architecture presents an interesting design where experts are organized into groups sharing adjugate experts, offering a fresh perspective on dynamic computation allocation inspired by heterogeneous computing principles, which provides a new direction for improving parameter efficiency in MoE models. - The paper demonstrates promising empirical results on a wide range of tasks spanning general knowledge, mathematical reasoning, and coding (e.g., notable improvements on AIME25
- The core mechanism of grouping experts and sharing adjugate experts is essentially a parameter-sharing strategy reminiscent of existing work on parallel blocks (e.g., AltUp), and the connection to big.LITTLE architecture appears more metaphorical than technically substantive, as the "dynamic" allocation is merely a byproduct of static grouping rather than adaptive routing. - The paper lacks ablation studies on critical components such as the loss-free load balancing strategy (Equation 8) and
1. The paper proposes an interesting change in the architecture, with the potential to improve MoE quality. 2. The scale of experiments is big. 3. The authors perform multiple stages of training, which resemble modern LLM training pipeline.
1. It is unclear how much of the improvement comes from the datasets used. 2. Limited reproducibility without the details regarding the dataset. 3. Limited comparison with architecture trained in the same, controlled setup.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications
