Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, Tat-Seng Chua

TL;DR
This paper introduces RLCER, a novel reinforcement learning approach that uses self-evolving rubrics to reward chain-of-thought reasoning in language models without human labels, improving performance and robustness.
Contribution
It proposes a self-evolving rubric-based reward mechanism for chain-of-thought reasoning, reducing reliance on human annotations and enhancing model training and inference.
Findings
RLCER outperforms outcome-centric RLVR in reasoning tasks.
Self-evolving rubrics provide reliable supervision signals without outcome rewards.
Rubrics as in-prompt hints improve inference-time reasoning performance.
Abstract
Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbodied and Extended Cognition · Machine Learning in Healthcare · Reinforcement Learning in Robotics
