Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, Xueqi Cheng

TL;DR
This paper introduces RGR-GRPO, a rubric-driven reinforcement learning framework that enhances multi-domain reasoning in large language models by providing dense rewards and offline guidance, leading to significant performance improvements.
Contribution
The paper presents RGR-GRPO, a novel RL framework leveraging rubrics for multi-domain reasoning, enabling better exploration and outperforming existing methods.
Findings
RGR-GRPO outperforms baseline RL methods across 14 benchmarks.
Achieves +7.0% to +8.4% improvements in key reasoning tasks.
Maintains stable entropy and enhances pass@k performance.
Abstract
Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling
