Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han

TL;DR
This paper introduces Co-rewarding, a self-supervised reinforcement learning framework that enhances the reasoning abilities of large language models by improving training stability and performance without relying on human annotations.
Contribution
It proposes a novel self-supervised RL method with dual instantiations that mitigate training collapse and improve reasoning performance in LLMs, surpassing existing self-rewarding approaches.
Findings
Outperforms other self-rewarding baselines by +3.31% on average.
Achieves +7.49% improvement on Llama-3.2-3B-Instruct.
Reaches or surpasses ground-truth reward performance in several benchmarks.
Abstract
While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from…
Peer Reviews
Decision·ICLR 2026 Poster
+ Introduces a clear and well-motivated framework that targets a real weakness in current self-rewarding RL methods: instability and collapse caused by single-view reward signals. + Extensive empirical results across multiple models, datasets, and baselines, including ablations that isolate the contribution of each component. + The paper is overall well-written, with clear mathematical formulation and structured presentation.
+ The reliance on high-quality paraphrasing for Co-rewarding-I is insufficiently examined. The framework may degrade when rephrasing quality is low or domain-specific, yet no robustness analysis is provided. Although the paper claims that rephrased questions should yield similar reasoning outcomes, in practice the reasoning trace can vary a lot depending on how the question is phrased. Thus, how do the authors ensure the meta-transferability of both reasoning and final answers during rephrasing?
1.The paper's claims are supported by comprehensive experiments. The authors validate their method's effectiveness across a diverse range of models (Qwen series, Llama-3.2-3B-Instruct)and multiple training datasets (MATH, DAPO-14k, OpenRS). The evaluation is similarly thorough, spanning not only in-domain mathematical reasoning but also out-of-domain tasks like code generation and general abilities (MMLU-Pro, IFEval) . This extensive validation strongly supports the paper's conclusions. 2. A key
1. Although I like the idea of Co-rewarding-I, it seems not that effective. It relies on additional stronger model to revise the training data but offers marginal improvement in across both training datasets. Sometimes may harm the performance. The author tries to show the effectiveness of rewarding-I with a ablation study but the ablation is not fair in my view point, the fair comparasion should be a model trained on the union of orignal and rephrased instead of training on them seperately sinc
The paper sets out to define a stable and performant approach to RL finetuning for reasoning without use of ground truth labels and the results demonstrate the effectiveness of the proposed method. Three models are used across in-domain and out-of-domain benchmarks and two separate training datasets are considered. Combined with suitable ablations and analysis of training dynamics (length, reward) this makes for a comprehensive results section.
The training datasets used are ones for which ground truth labels are available. It seems important to validate the method in a setting that is better motivated by self-supervised methods (i.e., those without the availability of verifiable rewards during training). The written communication of the paper could be improved. There are grammatical errors throughout. The related work currently features in the appendix and should be in the main paper text. The paper primarily reports pass@1 result
Code & Models
- 🤗TMLR-Group-HF/Co-rewarding-I-Qwen3-1.7B-Base-MATHmodel· 7 dl7 dl
- 🤗TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-MATHmodel· 6 dl6 dl
- 🤗TMLR-Group-HF/Co-rewarding-I-Qwen2.5-3B-MATHmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-MATHmodel· 40 dl· ♡ 140 dl♡ 1
- 🤗TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATHmodel· 47 dl· ♡ 147 dl♡ 1
- 🤗TMLR-Group-HF/Co-rewarding-I-Qwen2.5-7B-MATHmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-MATHmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗TMLR-Group-HF/Entropy-Qwen3-8B-Base-MATHmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-MATHmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗TMLR-Group-HF/GT-Qwen3-8B-Base-MATHmodel· 11 dl· ♡ 211 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
