Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Manish Bhattarai; Ismael Boureima; Nishath Rajiv Ranasinghe; Scott Pakin; Dan O'Malley

arXiv:2605.08061·cs.AI·May 11, 2026

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

PDF

TL;DR

This paper introduces rubric-grounded reinforcement learning, where a language model judge scores responses along multiple criteria, leading to improved reasoning and generalization in scientific and technical tasks.

Contribution

The authors formalize a new RL framework using multi-criterion, rubric-based rewards from an LLM judge, and demonstrate its effectiveness with scientific document rubrics and reasoning benchmarks.

Findings

01

Model achieves 71.7% normalized reward on rubric evaluation.

02

Improves performance on reasoning benchmarks GSM8K, MATH, GPQA Main, and GPQA Diamond.

03

Structured, document-grounded rewards enhance transferability of reasoning behaviors.

Abstract

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7%$ normalized reward on held-out rubric evaluation. The GRPO-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.