TL;DR
This paper introduces a router-aware importance sampling method to stabilize and enhance reinforcement learning training for Mixture-of-Experts models, leading to better convergence and performance.
Contribution
It presents a novel router-guided rescaling strategy for importance sampling, addressing instability issues in MoE RL training.
Findings
Improved training stability and convergence in MoE RL models
Enhanced final performance of MoE models with the proposed method
Demonstrated effectiveness across multiple experiments
Abstract
Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The proposed method addresses a contemporary stability problem at the heart of the LLM post training pipeline. The paper presents convincing empirical evidence that the proposed methods work on modern open weight MoE models such as Qwen3-30B-A3B and on a range of contemporary RL benchmark tasks. Last but not least, while maybe a bit ad-hoc, the proposed solution is relatively simple and boils down to a soft-regularization term in contrast to some recent alternative methods that instead propose
While the paper presents end-to-end results for the proposed methods, there are only few detailed analysis and ablation studies, even though the design includes several choices such as using the absolute log-difference or aggregating them multiplicatively. While these choices seem intuitively reasonable, they are often neither theoretically or experimentally confirmed.
(1) The paper is well-motivated. It deals with a challenging problem and proposes a well-explored solution. (2)The paper performss a Comprehensive Empirical Validation: The evaluation is rigorous, using both small-scale ablation studies and large-scale models across five diverse mathematical reasoning benchmarks. (3) The algorithm design is well-elaborated.
(1) Limited Task and Model Scope: The paper's results are limited to mathematical reasoning tasks and Qwen family of models. (2) Ablation Study Depth: While the main components are justified, a more detailed ablation study within RSPO itself—for instance, isolating the individual contribution of the router shift ratio from the geometric mean aggregation—would provide deeper insight into which aspect is most critical for the observed gains.
- The paper pinpoints a practical instability in MoE RL training. - The proposed fix is simple but sensible. The router-shift weighting makes intuitive sense and is easy to integrate. - The topic is timely and practically significant for large-scale RLHF/RLVR pipelines. - Experiments demonstrate performance boosts and stability improvements on competitive models. The ablations on router freezing and replay variants add useful context.
- The novelty over GSPO/GMPO is limited. The router-shift term is the main difference, and there’s no theoretical analysis to back its variance-reduction claim. - Important details (values for K, gamma_min, and clipping thresholds) are missing, making it hard to judge reproducibility or sensitivity. - Lack of quantitative understanding of the stability improvement. The paper does not measure or visualize the actual variance reduction, distribution of importance ratios, or clipping rates. Includi
The method specifically addresses unique MoE failure modes. RSPO demonstrates significantly more stable training than GRPO (reward collaps) and does outperform GSPO and GMPO on the evaluation tasks but only marginally. They provide some analysis on why the RSPO prevents collaps.
I think the results are not super strong, especially comparing with GSPO and GMPO does not really show any significant benefit at least on the tasks that have been presented. Another weakness is that the evaluation tasks are all from within a very similar domain, raising questions about how well this method would work on a broader set of domains.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
