RLSR: Reinforcement Learning from Self Reward
Toby Simonds, Kevin Lopez, Akira Yoshiyama, Dominique Garmier

TL;DR
This paper introduces a method where large language models self-judge their solutions to generate reward signals, enabling reinforcement learning without external ground truth, thus broadening the applicability of autonomous self-improvement in AI.
Contribution
The authors demonstrate that LLMs can self-judge solutions to provide effective rewards, eliminating the need for external verification and enabling reinforcement learning in complex, reward-scarce domains.
Findings
Models can reliably self-judge without ground truth.
Self-judging enables performance comparable to formal verification.
Self-supervised training led to competitive results in integration tasks.
Abstract
Large language models can generate solutions to complex problems, but training them with reinforcement learning typically requires verifiable rewards that are expensive to create and not possible for all domains. We demonstrate that LLMs can effectively self-improve through self-judging without reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains where verifiable rewards are impractical. By implementing self-judging across Countdown puzzles and integration problems, we achieve performance comparable to formal verification without ground truth solutions. Most notably, Qwen 2.5 7B DeepSeek Distilled trained with self-rewards qualifies for the prestigious MIT Integration Bee competition, performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
