A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

Jinyi Han; Xinyi Wang; Haiquan Zhao; Tingyun li; Zishang Jiang; Sihang Jiang; Jiaqing Liang; Xin Lin; Weikang Zhou; Zeye Sun; Fei Yu; Yanghua Xiao

arXiv:2508.12903·cs.CL·October 7, 2025

A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao

PDF

Open Access 3 Reviews

TL;DR

ProActive Self-Refinement (PASR) enables large language models to dynamically decide when and how to refine their outputs during generation, improving accuracy and efficiency across multiple tasks.

Contribution

This paper introduces PASR, a proactive method allowing LLMs to refine outputs during generation based on internal states, unlike prior reactive approaches.

Findings

01

PASR reduces token consumption by 41.6% on Qwen3-8B.

02

PASR improves accuracy by 8.2%.

03

PASR outperforms standard generation across 10 diverse tasks.

Abstract

Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The paper presents strong results using PASR in comparison to other self-refinement methods. The paper compares PASR with 8 other self-refinement baselines. 2. PASR is token efficient. It does effective refinement using less number of tokens compared to many baselines. 3. The paper is well written and provides a clear explanation of all the rewards used. It also presents comprehensive ablation study of PASR.

Weaknesses

1. The paper could also show ablations on different rewards used to understand which ones are most effective and have the most impact on downstream accuracies.

Reviewer 02Rating 6Confidence 4

Strengths

* A large number of datasets covering diverse reasoning capabilities are explored * The authors evaluate against a large set of baselines, making their results much stronger and contextualized * The method is written relatively cleanly and explained well. In particular, the reward design section is particularly informative, as it details and justifies each component of the multi-dimensional reward well. The intuition is clear and helpful here. * The proposed method consistently outperforms base

Weaknesses

* Only Qwen models of similar sizes are tested in this work. It would be important to show that the findings here generalize to other models, especially in light of recent discussion on Qwen's SFT making it perform much different than other base models at inference time. It would also be nice to have a scaling comparison - how do these results change on larger/smaller models of the same architecture? It's unclear if rollouts can be obtained in a zero-shot way with other reasoning models while fo

Reviewer 03Rating 4Confidence 3

Strengths

1. The idea of moving from reactive post-hoc correction to proactive in-process self-refinement is novel. 2. MDP formulation and GRPO-based RL training are clearly presented. 3. Effective reward scheme design: combines structure, correctness, and refinement quality to balance precision and efficiency. 4. Strong results are shown across diverse tasks and baselines (Self-Refine, PTR, SCoRe, RISE).

Weaknesses

1. The accuracy and refinement rewards depend on another LLM for scoring, which introduces potential bias and circular evaluation concerns (especially if the same family of models is used for training and evaluation). 2. Only schematic examples (e.g., the logic puzzle in Figure 1) are shown. More real examples illustrating PASR’s step-by-step refinement and error correction would improve interpretability. 3. RL overhead compared to SFT methods is not analyzed. 4. The paper focuses mainly on rewa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques