Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Shelly Bensal; Umar Jamil; Christopher Bryant; Melisa Russak; Kiran Kamble; Dmytro Mozolevskyi; Muayad Ali; Waseem AlShikh

arXiv:2505.24726·cs.CL·June 2, 2025

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh

PDF

TL;DR

This paper introduces a self-improving framework for large language models that uses self-reflection and reinforcement learning to enhance performance on complex tasks with limited feedback.

Contribution

It presents a novel two-stage self-reflection and reinforcement learning method enabling models to self-improve without synthetic data or detailed feedback.

Findings

01

Up to 34.7% improvement in math tasks

02

Up to 18.1% improvement in function calling

03

Smaller models outperform larger ones in fine-tuning

Abstract

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.