Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization
Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Qirui Mi, Guoqing Liu, Zexu Sun, Mengyue Yang, Dong Li, Weiyu Ma, Ning Yang, Jian Zhao, Jianye Hao, Haifeng Zhang, Jun Wang

TL;DR
This paper introduces EVOLVE, a framework that enhances large language models' ability to self-refine responses through combined training and inference optimization, leading to improved performance and broader self-improvement capabilities.
Contribution
The paper presents a novel synergistic training-inference optimization method to activate and evolve LLMs' self-refinement ability, which was previously limited or ineffective.
Findings
Evolved Self-Refinement improves response quality and consistency.
Models surpass GPT-4o on multiple benchmarks with refined responses.
Self-Refinement generalizes well to out-of-domain reasoning tasks.
Abstract
Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model's Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- the paper studies and interesting problem of improving "self refinement", which can be thought of as an important form of reasoning for LLMs. - the evaluation set up follows a well established method, which is similar to e.g. Self-Rewarding Language Models (Yuan et al. ICML 2024, https://openreview.net/pdf?id=0NphYCmgua) and Iterative Reasoning Preference Optimization (Pang et al, Neurips 2024 -- paper cites the archive pre-print by Yuan et al. only). - the paper surfaces a significant reasoni
## Self-Refinement vs Self-Improvement - I think the paper should define more clearly, what is meant by self refinement versus self improvement. I think that the authors mean that self refinement is an inference time procedure, while self improvement means using the model's generations in additional training. - While existing models show a roughly neutral to strongly negative impact when using self refinement, the methods proposed in this paper do not result in models which can significantly ref
- The paper addresses a central open question in self-improving LLMs: Do models inherently self-refine? The empirical finding, they do not, is significant and timely. - It is a comprehensive, end-to-end solution that correctly identifies that a new capability requires both training and a mechanism to apply it. It also identifies and formalizes self-refinement activation and training. - The empirical results seem to be quite strong.
- While results on AlpacaEval2/Arena-Hard are strong, evaluation is largely in general-instruction domains. Math results are promising but limited. The paper could use an inclusion of a broader set (coding, safety, knowledge-intensive QA) to test if self-refinement holds across modalities and task difficulty. - EVOLVE uses reward-model scoring to build preference data. Reward-model bias or reward hacking is lightly discussed but not fully quantified. Maybe the authors can report RM perplexity dr
- The performance is good. - They introduce a new loss function to optimize the policy for enabling self-correction. - They explore four different ways to use SR capability during inference. The comparison among the four patterns is interesting. ((it also feels somewhat expected, as Chain of Self-Refinement follows a process similar to the training distribution. I believe there is still room for exploration in how SR is applied at inference time. For example, from a test-time scaling perspective
- Only small models (7B, 8B) are tested. It would be even better if experiments were conducted on larger models. - In Figure 1, the authors use Qwen2.5-7B and Gemma 2-9B, but these models are not used in Section 4.1. Why? It would be preferable to conduct a more comprehensive set of experiments covering all models.
- The paper is overall well-written and easy to follow. - The proposed algorithm works for non-Qwen models, which is relatively unique in recent days. - Detailed ablation study is provided to support the proposed algorithm.
- The base model and tested dataset are relatively weak. No difficult dataset like AIME, and Olympiad-bench are tested. - Baselines are relatively weak. While a list of different algorithms are used, e.g., Iterative DPO, SRPO. No popular algorithms like GRPO / DAPO / GSPO are compared. This is very important since Deepseek-R1 has already shown that self-reflection capability can be learned directly through these algorithms. - While the claim is on enabling self-reflection capability of the LLMs,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsBalanced Selection
