MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Xiaodong Chen; Mingming Ha; Zhenzhong Lan; Jing Zhang; Jianguo Li

arXiv:2508.05257·cs.LG·August 8, 2025

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Xiaodong Chen, Mingming Ha, Zhenzhong Lan, Jing Zhang, Jianguo Li

PDF

2 Models 3 Reviews

TL;DR

The paper proposes MoBE, a novel compression method for MoE-based large language models that significantly reduces parameters with minimal accuracy loss by decomposing expert matrices into shared basis components.

Contribution

MoBE introduces a basis-sharing decomposition of expert matrices in MoE models, enabling effective compression while maintaining high accuracy.

Findings

01

Achieves 24-30% parameter reduction with only 1-2% accuracy drop.

02

Outperforms prior compression methods in accuracy retention.

03

Demonstrates effectiveness on models with up to 1 trillion parameters.

Abstract

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The problem is important for deploying trillion-level MoE. The idea is simple but quite effective. The paper proposes to decompose the up/gate matrix into shared basis matrices B across experts to capture the common information across experts and keep matrix A per expert to encode specific information, and to add non-linearity inside the matrix factorization to enhance representational power. - The paper is well-written. The equations and algorithm steps are easy to follow. - The paper conduct

Weaknesses

- The paper comes with limited theory of formal approximation guarantees. Most are from empirical studies. - The choice of hyper-parameters lacks guidance, including the choice of basis count m and the rank r. The compression rate and the accuracy frontiers are not fully mapped. - No study of light-weight finetuning or knowledge distillation to close the last 1%-2% gap.

Reviewer 02Rating 4Confidence 4

Strengths

- **Novel and theoretically sound compression framework** - The MoBE formulation, where each expert is a weighted sum of basis experts, provides a principled way to capture and exploit inter-expert redundancy ($\text{Expert}\_i = \sum\_j \alpha\_{ij} \cdot \text{Basis}\_j$, Eq. 1; Sec. 3.2; p.4). This is a clear and impactful contribution. - The framework naturally separates shared knowledge (the basis experts) from specialized knowledge (the combination coefficients), offering a more struct

Weaknesses

- **Missing key experimental results and references** - The paper repeatedly references **Table 3** for key quantitative results that are central to its claims of outperforming baselines. However, **Table 3 does not exist** in the manuscript or its appendices. This is a critical omission that makes it impossible to verify the core experimental findings. - The review references **Table 8** and **Figure 9** in the appendices for further analysis, but these elements are also **not found** in th

Reviewer 03Rating 8Confidence 3

Strengths

A meaningful architectural re-parameterisation of MoE experts that is novel relative to linear SVD-sharing approaches and practically validated at unprecedented model scales. Results seem impressive and should be reproducible (I'm assuming there will be a link to code if the paper is accepted).

Weaknesses

Report end-to-end efficiency, not just parameter counts Strengthen parity and scalability of baselines - D2-MoE is omitted on trillion-scale models for feasibility; include either (a) scaled-down controlled runs at matched ratios, or (b) additional scalable baselines, so large-model wins aren’t confounded by method availability. Broaden ablations/analyses - in particular I'd be interested in an analysis involving downstream tasks.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.