Teaching Large Reasoning Models Effective Reflection
Hanbin Wang, Jingwei Song, Jinpeng Li, Qi Zhu, Fei Mi, Ganqu Cui, Yasheng Wang, Lifeng Shang

TL;DR
This paper introduces SCFT and RLERR, novel training methods that improve large reasoning models by fostering effective self-reflection, leading to better reasoning accuracy and reflection quality on challenging benchmarks.
Contribution
The paper presents SCFT and RLERR, innovative techniques that enhance the reflective reasoning capabilities of LRMs through self-generated critiques and reinforcement learning.
Findings
SCFT improves critique quality and reasoning accuracy.
RLERR guides models to internalize self-correction effectively.
Methods outperform state-of-the-art baselines on AIME benchmarks.
Abstract
Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks, often by engaging in self-reflective behaviors such as self-critique and backtracking. However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer and incurring computation overhead. In this paper, we identify and address the problem of superficial reflection in LRMs. We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques. SCFT prompts models to critique their own outputs, filters high-quality critiques through rejection sampling, and fine-tunes the model using a critique-based objective. Building on this strong foundation, we further introduce Reinforcement Learning with Effective Reflection Rewards (RLERR). RLERR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
