Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

TL;DR
This paper introduces rubric-grounded reinforcement learning, where a language model judge scores responses along multiple criteria, leading to improved reasoning and generalization in scientific and technical tasks.
Contribution
The authors formalize a new RL framework using multi-criterion, rubric-based rewards from an LLM judge, and demonstrate its effectiveness with scientific document rubrics and reasoning benchmarks.
Findings
Model achieves 71.7% normalized reward on rubric evaluation.
Improves performance on reasoning benchmarks GSM8K, MATH, GPQA Main, and GPQA Diamond.
Structured, document-grounded rewards enhance transferability of reasoning behaviors.
Abstract
We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves normalized reward on held-out rubric evaluation. The GRPO-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
