$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

Pinzheng Wang; Shuli Xu; Juntao Li; Yu Luo; Dong Li; Jianye Hao; Min Zhang

arXiv:2603.07197·cs.AI·March 10, 2026

$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

Pinzheng Wang, Shuli Xu, Juntao Li, Yu Luo, Dong Li, Jianye Hao, Min Zhang

PDF

Open Access 3 Reviews

TL;DR

Re$^2$ enables large language models to improve reasoning by learning to abandon unproductive paths and restart, significantly increasing redo behavior and boosting performance without supervised fine-tuning.

Contribution

Introducing Re$^2$, a reinforcement learning method that allows LLMs to selectively restart reasoning paths, enhancing reasoning efficiency and accuracy without prior supervised training.

Findings

01

Redo behavior increased from 0.5% to over 30%.

02

Performance gains over standard RLVR with same compute.

03

Notable improvements with increased test samples.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re $^{2}$ ), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re $^{2}$ applies pure reinforcement learning…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The method is well-motivated. I appreciate that the author talks about the problem in section 3 before introducing the proposed method. - The idea of allowing the model to restart the reasoning makes a lot of sense to me. - The paper is well-written and easy to follow.

Weaknesses

- The performance gain is limited to me in Table 1. I understand that it may be subjective on whether +5.8 is significant or just noise. I suggest including confidence interval when the results are this close.

Reviewer 02Rating 4Confidence 3

Strengths

The core idea is original in the RLVR context: instead of privileging a single forward chain, the policy is explicitly trained to reconsider and restart when early evidence suggests confusion. This operationalizes a common human strategy and formalizes it with a clear three-way action space and a principled reward for resolve based on out-of-group success rates with a finite-round geometric expectation, yielding an intuitively aligned decision rule between “continue” and “restart.” Methodologi

Weaknesses

Althouth the truncate-then-continuate method is shown to be effecitve and better than DAPO, it is much slower in rollout procedure. I would be interested to see how much computaional cost overhead is added for this continuation step. Weaknesses Comparisons emphasize DAPO, but the space of modern RLVR baselines is broader (e.g., GRPO-style group objectives, critique-assisted RL, efficient RL variants), and decoding-time strategies that terminate low-confidence chains or backtrack could provide

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper proposes a novel re-solving paradigm that lets LLMs restart reasoning when current paths fail, breaking from the one-shot chain-of-thought assumption in prior RLVR work. 2. The approach is rigorously implemented, with a clear algorithm, fair baselines, and strong empirical validation across multiple benchmarks.

Weaknesses

While the paper presents an interesting framework, the underlying mechanisms behind the method's effectiveness are not sufficiently analyzed. It remains unclear why the re-solving behavior leads to higher reasoning accuracy or a higher performance ceiling. For example, does restarting primarily improve exploration diversity, mitigate early trajectory bias, or simply increase the effective number of rollouts? A deeper behavioral or theoretical analysis would make the contribution more convincing.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications