TL;DR
The paper proposes MoBE, a novel compression method for MoE-based large language models that significantly reduces parameters with minimal accuracy loss by decomposing expert matrices into shared basis components.
Contribution
MoBE introduces a basis-sharing decomposition of expert matrices in MoE models, enabling effective compression while maintaining high accuracy.
Findings
Achieves 24-30% parameter reduction with only 1-2% accuracy drop.
Outperforms prior compression methods in accuracy retention.
Demonstrates effectiveness on models with up to 1 trillion parameters.
Abstract
The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis…
Peer Reviews
Decision·ICLR 2026 Poster
- The problem is important for deploying trillion-level MoE. The idea is simple but quite effective. The paper proposes to decompose the up/gate matrix into shared basis matrices B across experts to capture the common information across experts and keep matrix A per expert to encode specific information, and to add non-linearity inside the matrix factorization to enhance representational power. - The paper is well-written. The equations and algorithm steps are easy to follow. - The paper conduct
- The paper comes with limited theory of formal approximation guarantees. Most are from empirical studies. - The choice of hyper-parameters lacks guidance, including the choice of basis count m and the rank r. The compression rate and the accuracy frontiers are not fully mapped. - No study of light-weight finetuning or knowledge distillation to close the last 1%-2% gap.
- **Novel and theoretically sound compression framework** - The MoBE formulation, where each expert is a weighted sum of basis experts, provides a principled way to capture and exploit inter-expert redundancy ($\text{Expert}\_i = \sum\_j \alpha\_{ij} \cdot \text{Basis}\_j$, Eq. 1; Sec. 3.2; p.4). This is a clear and impactful contribution. - The framework naturally separates shared knowledge (the basis experts) from specialized knowledge (the combination coefficients), offering a more struct
- **Missing key experimental results and references** - The paper repeatedly references **Table 3** for key quantitative results that are central to its claims of outperforming baselines. However, **Table 3 does not exist** in the manuscript or its appendices. This is a critical omission that makes it impossible to verify the core experimental findings. - The review references **Table 8** and **Figure 9** in the appendices for further analysis, but these elements are also **not found** in th
A meaningful architectural re-parameterisation of MoE experts that is novel relative to linear SVD-sharing approaches and practically validated at unprecedented model scales. Results seem impressive and should be reproducible (I'm assuming there will be a link to code if the paper is accepted).
Report end-to-end efficiency, not just parameter counts Strengthen parity and scalability of baselines - D2-MoE is omitted on trillion-scale models for feasibility; include either (a) scaled-down controlled runs at matched ratios, or (b) additional scalable baselines, so large-model wins aren’t confounded by method availability. Broaden ablations/analyses - in particular I'd be interested in an analysis involving downstream tasks.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
