TL;DR
ICRL introduces a reinforcement learning framework enabling language models to internalize self-critique, leading to improved performance without external feedback across reasoning tasks.
Contribution
The paper presents a novel joint training method for solvers and critics, improving model self-improvement and critique internalization using shared backbones and new stabilization techniques.
Findings
Achieved 6.4 and 7.0 point improvements on agentic and mathematical reasoning tasks.
Learned 8B critic performs comparably to 32B critics with fewer tokens.
Demonstrated consistent performance gains across diverse benchmarks.
Abstract
Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
