# Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

**Authors:** Qibin Wang, Pu Zhao, Shaohan Huang, Fangkai Yang, Lu Wang, Furu Wei, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

arXiv: 2509.00084 · 2026-04-28

## TL;DR

This paper introduces a new metric called the Refinement Gap and a framework named Generative Self-Refinement (GSR) that improves reasoning in large language models by transferring refinement skills from larger to smaller models, achieving state-of-the-art results.

## Contribution

The paper proposes the Refinement Gap metric and GSR framework, enabling effective self-refinement transfer across models and domains, enhancing reasoning performance.

## Key findings

- Refinement Gap scales with model size and is weakly correlated with base capability.
- GSR outperforms existing parallel aggregation methods on five mathematical benchmarks.
- The learned refinement skill generalizes across model scales, families, and out-of-distribution domains.

## Abstract

Test-time scaling (TTS) has gained widespread attention for enhancing LLM reasoning. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Parallel self-refinement, generating multiple candidates and synthesizing a refined answer conditioned on them, offers a promising alternative, but the underlying mechanism driving its effectiveness remains obscure. To bridge this gap in understanding, we introduce a new metric, the Refinement Gap, designed to quantify the relative improvement of self-refinement beyond majority voting. We show that the Refinement Gap exhibits a clear scaling trend with model size and is only weakly correlated with the base capability. Based on this discovery, we propose Generative Self-Refinement (GSR), a parallel test-time scaling framework that transfers the refinement policy from larger teacher models with higher refinement gap into smaller students. Crucially, GSR jointly trains a single model to generate strong candidates and refine a better final answer based on these candidates. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks over other parallel aggregation methods, while the learned refinement skill transfers across multiple model scales and families and exhibits robust generalization to an out-of-distribution domain.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00084/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00084/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/2509.00084/full.md

---
Source: https://tomesphere.com/paper/2509.00084