Self-rewarding correction for mathematical reasoning
Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong, Zhang

TL;DR
This paper introduces a novel self-rewarding reasoning framework for large language models that enables autonomous error detection, correction, and iterative refinement without external feedback, improving mathematical reasoning performance.
Contribution
The paper proposes a two-stage algorithmic framework for training self-rewarding models using self-generated data, enhancing autonomous reasoning and correction capabilities.
Findings
Outperforms intrinsic self-correction methods.
Achieves performance comparable to external reward systems.
Demonstrates effectiveness on Llama-3 and Qwen-2.5 models.
Abstract
We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications
MethodsFocus
