Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal; Anthony Wang; Elaine Lau; Vaskar Nath; Yunzhong He; Bing Liu; Sean Hendryx

arXiv:2507.17746·cs.LG·October 6, 2025

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, Sean Hendryx

PDF

Open Access 4 Datasets 3 Reviews

TL;DR

This paper introduces RaR, a reinforcement learning method that uses rubric-based feedback as rewards to improve large language models' reasoning in complex, real-world tasks beyond verifiable domains.

Contribution

RaR extends RLVR to non-verifiable domains by leveraging rubric-based feedback as reward signals, demonstrating improved performance and alignment in medical and science reasoning tasks.

Findings

01

RaR achieves up to 31% improvement on HealthBench.

02

RaR outperforms baseline models on GPQA-Diamond.

03

Rubrics as rewards improve model alignment and reduce variance.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $Rubrics as Rewards$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31%$ on HealthBench and $7%$ on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1 The paper tackles a timely and important limitation of RLVR by moving beyond domains with verifiable, binary reward signals, introducing a pragmatic and interpretable alternative via structured rubrics for open-ended reasoning. 2 Empirical evaluation is thorough, utilizing large and diverse medical and science benchmarks (HealthBench, GPQA-Diamond), and benchmarking several methods including Direct-Likert, Reference-Likert, and filtered SFT, ensuring a fair assessment. 3 The paper offers the

Weaknesses

1 Limited Generalization and Scope: Experimental evaluation is confined to domains with relatively structured reasoning (medicine, science) and primarily question-answering tasks. The approach is not shown to generalize to less-structured or dialog/agentic settings noted in Section 9. 2 Implementation Ambiguities: Category-to-weight mapping for explicit aggregation ("Essential": 1.0, etc.) is somewhat arbitrary, and no optimization or sensitivity analysis is provided. Table 2 suggests some rob

Reviewer 02Rating 4Confidence 5

Strengths

1. The authors propose an innovative approach that uses rubric-based rewards to train models on tasks lacking verifiable reward signals in RL. 2. The experiments and analyses demonstrate that rubric-based rewards align well with expert human preferences and are effective when the LLM-as-a-Judge model is sufficiently capable. 3. The authors conduct experiments in the medical and scientific domains to demonstrate the effectiveness of the proposed rubric-based reinforcement learning paradigm.

Weaknesses

1. The authors employ large-scale, instance-level rubric labeling for all training samples and use an LLM-as-a-Judge model (GPT-4o-mini) to generate rubric-based rewards in training. While this approach incurs substantial computational costs—since each rubric evaluation consumes additional tokens—the resulting performance gains, particularly on GPQA, are marginal (only 1.1 points compared to the Reference-Likert baseline). 2. This paper uses only a single model, Qwen2.5-7B, as the policy model

Reviewer 03Rating 6Confidence 2

Strengths

The paper successfully bridges a critical gap between RLVR and preference-based RLHF by introducing a structured, multi-criteria intermediate: the rubric. The method demonstrates consistent and significant improvements over strong baselines across two challenging domains medicine and science. The relative gains of up to 31% on HealthBench are substantial.

Weaknesses

The entire RaR pipeline's success is contingent on the quality of the synthetically generated rubrics. While ablations show the importance of using reference answers, the inherent limitations of LLM-based generation are not fully addressed. The performance ceiling is thus tied to the rubric generator's capabilities. The method requires generating a unique, detailed rubric for each of the 20k+ training instances and then using an LLM judge to evaluate 16 rollouts per prompt. This is computational

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Teaching and Learning Methods