MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia; Jinghui Lu; Haiyang Yu; Siqi Wang; Guozhi Tang; An-Lan Wang; Weijie Yin; Dingkang Yang; Yuxiang Nie; Bin Shan; Hao Feng; Irene Li; Kun Yang; Han Wang; Jingqun Tang; Teng Fu; Changhong Jin; Chao Feng; Xiaohui Lv; Can Huang

arXiv:2508.09670·cs.AI·December 19, 2025

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang

PDF

1 Video

TL;DR

This paper introduces MEML-GRPO, a novel multi-expert mutual learning framework for reinforcement learning with verifiable rewards, significantly improving reasoning performance of large language models by addressing reward sparsity and enhancing knowledge sharing.

Contribution

The paper presents a new multi-expert mutual learning approach that uses diverse prompts and inter-expert knowledge transfer to improve RLVR in large language models.

Findings

01

Achieves an average performance gain of 4.89% with Qwen.

02

Achieves an average performance gain of 11.33% with Llama.

03

Effectively overcomes reward sparsity issues in RLVR.

Abstract

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement· underline