Step-wise Rubric Rewards for LLM Reasoning

Weichu Xie; Haozhe Zhao; Wenpu Liu; Yongfu Zhu; Liang Chen; Minghao Ye; Zirong Chen; Yuqi Xu; Shuai Dong; Ziyue Wang; Xinbo Xu; Kean Shi; Ruoyu Wu; Xiaoying Zhang; Wenqi Shao; Baobao Chang; Nan Duan; Jiaqi Wang

arXiv:2605.17291·cs.LG·May 19, 2026

Step-wise Rubric Rewards for LLM Reasoning

Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, Xinbo Xu, Kean Shi, Ruoyu Wu, Xiaoying Zhang, Wenqi Shao, Baobao Chang, Nan Duan, Jiaqi Wang

PDF

TL;DR

This paper introduces Step-wise Rubrics as Rewards (SRaR), a novel reinforcement learning framework that improves reasoning accuracy in large language models by providing fine-grained, step-level supervision and addressing reward hacking issues.

Contribution

SRaR uses an LLM judge for step attribution, normalizes per-step scores, and combines rewards with a decoupled advantage estimator, significantly enhancing reasoning performance.

Findings

01

SRaR improves accuracy by 3.57 and 2.75 points on two benchmarks.

02

Raises Faithful Reasoning Rate from 34.5% to 46.7%.

03

Reduces self-correction looping from 48.1% to 26.5%.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.