Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Haoyuan Wu; Haoxing Chen; Xiaodong Chen; Zhanchao Zhou; Tieyuan Chen; Yihong Zhuang; Guoshan Lu; Zenan Huang; Junbo Zhao; Lin Liu; Zhenzhong Lan; Bei Yu; Jianguo Li

arXiv:2508.07785·cs.CL·August 12, 2025

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li

PDF

Open Access 3 Models 3 Reviews

TL;DR

Grove MoE introduces a heterogeneous expert architecture with dynamic activation, improving efficiency and performance of large language models by activating only relevant parameters based on input complexity.

Contribution

The paper proposes Grove MoE, a novel architecture with experts of varying sizes and a dynamic activation mechanism, enhancing efficiency and scalability of MoE-based LLMs.

Findings

01

Grove MoE models activate 3.14-3.28B parameters based on token complexity.

02

Grove MoE models achieve comparable performance to larger SOTA models.

03

Dynamic expert activation improves computational efficiency.

Abstract

The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper is generally well-written and clearly structured, making the proposed method easy to follow. The experimental evaluation is fairly comprehensive, covering multiple benchmarks and including relevant ablation studies to analyze the impact of key design choices.

Weaknesses

* **Prior Works Comparison**: Allocating different expert capacities via structural scaling is related to a line of recent works [1,2] that explore heterogeneous expert sizes within MoE layers. However, the paper does not compare against or discuss these prior approaches. * **Loading Balance Sensitivity**: While Grove MoE incorporates a load-balancing mechanism, the paper does not sufficiently address whether its heterogeneous expert structure is more sensitive to imbalance compared to standar

Reviewer 02Rating 6Confidence 3

Strengths

- The Grove MoE architecture presents an interesting design where experts are organized into groups sharing adjugate experts, offering a fresh perspective on dynamic computation allocation inspired by heterogeneous computing principles, which provides a new direction for improving parameter efficiency in MoE models. - The paper demonstrates promising empirical results on a wide range of tasks spanning general knowledge, mathematical reasoning, and coding (e.g., notable improvements on AIME25

Weaknesses

- The core mechanism of grouping experts and sharing adjugate experts is essentially a parameter-sharing strategy reminiscent of existing work on parallel blocks (e.g., AltUp), and the connection to big.LITTLE architecture appears more metaphorical than technically substantive, as the "dynamic" allocation is merely a byproduct of static grouping rather than adaptive routing. - The paper lacks ablation studies on critical components such as the loss-free load balancing strategy (Equation 8) and

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper proposes an interesting change in the architecture, with the potential to improve MoE quality. 2. The scale of experiments is big. 3. The authors perform multiple stages of training, which resemble modern LLM training pipeline.

Weaknesses

1. It is unclear how much of the improvement comes from the datasets used. 2. Limited reproducibility without the details regarding the dataset. 3. Limited comparison with architecture trained in the same, controlled setup.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications