ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
Zheyue Tan, Zhiyuan Li, Tao Yuan, Dong Zhou, Weilin Liu, Yueqing Zhuang, Yadong Li, Guowei Niu, Cheng Qin, Zhuyu Yao, Congyi Liu, Haiyang Xu, Boxun Li, Guohao Dai, Bo Zhao, Yu Wang

TL;DR
ReXMoE introduces a novel MoE architecture that reuses experts across layers with a progressive scaling routing strategy, enhancing model expressiveness and performance without increasing parameters.
Contribution
It proposes ReXMoE, a new MoE design that allows expert reuse across layers and a progressive routing method to improve scalability and effectiveness.
Findings
ReXMoE outperforms traditional layer-local MoE models in language tasks.
The approach improves performance across models from 0.5B to 7B parameters.
ReXMoE maintains efficiency while enhancing model expressiveness.
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper introduces a conceptually elegant approach to MoE design by enabling cross-layer expert reuse, representing a meaningful departure from conventional layer-local routing. While parameter sharing exists in prior work, applying it specifically to MoE blocks with Progressive Scaling Routing offers a fresh perspective on balancing expert capacity and routing diversity. The minimal overhead (only router parameters) and consistent improvements across multiple model scales (0.5B to 7B) demon
- The results in Figure 2(a) reveal substantial prefill speed degradation (up to 77% slowdown for short sequences with R8 configuration), which significantly limits the practical applicability of REXMOE in latency-sensitive applications. While the authors acknowledge this issue stems from increased I/O operations due to the larger expert pool, they do not explore potential mitigation strategies or provide detailed profiling to identify the exact bottlenecks. The paper would benefit from: (1) a b
This paper designs REXMOE, a method that breaks the limitation of layer-local routing in MoE architectures and proposes a Progressive Scaling Routing strategy in REXMOE, which gradually enlarges the candidate expert pool during training, thereby reducing language modeling loss and improving downstream task accuracy
1. There is a lack of theoretical analysis on the effectiveness of REXMOE, particularly regarding the expert combination numbers and PSR mentioned by the authors. 2. Ablation experiments indicate that it is PSR rather than cross-layer expert reuse that yields substantial improvements. Thus, one may question the necessity of cross-layer reuse—given that such reuse would affect pipeline parallelism (pp) and expert parallelism (ep) strategies when the model scales. This is particularly critical for
- Compared to existing models, the proposed approach improves model performance by increasing the expert pool for each layer while maintaining the same number of parameters. - To address the limitation that simply expanding the expert pool leads to only marginal performance improvements, the authors propose a novel training methodology. Through an ablation study, they demonstrate not only the effectiveness of the new architecture but also its practical applicability. - They observed that task-sp
- As the size of the expert pool increases, there is a significant slowdown in the prefill stage. While it is acknowledged that decoding speed plays a more critical role in inference, the slowdown during the prefill phase becomes a weakness of this methodology, especially considering that token sequences can be quite long in recent large language models (LLMs). - The authors conducted experiments by varying the reuse frequency $r$, and according to their claims, performance improves as $r$ incre
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Topic Modeling · Domain Adaptation and Few-Shot Learning
