Reinforcement Learning with Rubric Anchors

Zenan Huang; Yihong Zhuang; Guoshan Lu; Zeyu Qin; Haokai Xu; Tianyu Zhao; Ru Peng; Jiaqi Hu; Zhanming Shen; Xiaomeng Hu; Xijun Gu; Peiyi Tu; Jiaxin Liu; Wenyu Chen; Yuzhuo Fu; Zhiting Fan; Yanmei Gu; Yuanyuan Wang; Zhengkai Yang; Jianguo Li; Junbo Zhao

arXiv:2508.12790·cs.AI·August 19, 2025

Reinforcement Learning with Rubric Anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao

PDF

Open Access 1 Models

TL;DR

This paper extends reinforcement learning with verifiable rewards to open-ended tasks by using a large, structured rubric-based reward system, enabling improved language model performance and more human-like responses.

Contribution

It introduces a novel rubric-based reward framework for RLVR in open-ended tasks, with the largest rubric system to date and demonstrated improvements on benchmarks.

Findings

01

Achieved +5.2% improvement on open-ended benchmarks with 5K+ samples

02

Outperformed a 671B model by +2.4% on certain tasks

03

Enabled fine-grained stylistic control for more human-like responses

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
inclusionAI/Rubicon-Preview
model· 88 dl· ♡ 25
88 dl♡ 25

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Mechanisms and Dynamics · Gear and Bearing Dynamics Analysis