ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao; Qianfeng Wen; Blair Yang; Zhenwei Tang; Ashton Anderson

arXiv:2604.01591·cs.AI·April 8, 2026

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson

PDF

1 Repo 2 Models

TL;DR

ThinkTwice is a two-phase training framework that enhances large language models' reasoning and self-refinement abilities through joint optimization, leading to significant performance improvements on mathematical benchmarks.

Contribution

It introduces a novel joint training approach using Group Relative Policy Optimization that improves reasoning and self-refinement without requiring explicit correctness signals.

Findings

01

ThinkTwice outperforms baseline methods on five reasoning benchmarks.

02

Self-refinement in ThinkTwice significantly boosts accuracy after one iteration.

03

Training dynamics show a curriculum where early errors are corrected and solutions are preserved as models improve.

Abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csslab/ThinkTwice
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.