PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Yuhua Jiang; Yuwen Xiong; Yufeng Yuan; Chao Xin; Wenyuan Xu; Yu Yue; Qianchuan Zhao; Lin Yan

arXiv:2506.10406·cs.CL·June 13, 2025

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan

PDF

Open Access 4 Reviews

TL;DR

PAG introduces a unified reinforcement learning framework enabling large language models to self-correct by selectively verifying and revising their outputs, improving reasoning accuracy without external verifiers.

Contribution

It proposes a novel verify-then-revise mechanism within a multi-turn RL paradigm, enhancing LLM self-correction and verification capabilities in a unified approach.

Findings

01

PAG improves self-correction accuracy across reasoning benchmarks.

02

Self-verification with PAG outperforms self-consistency methods.

03

Selective revision reduces unnecessary model revisions.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well written and clearly presented, making it easy to follow. 2. The model introduces three concrete and well-motivated components—turn-independent optimization, bonus reward, and RoleAdvNorm—which are thoughtfully engineered to stabilize multi-turn RL training. 3. The empirical results are strong against baselines, especially in test-time scaling against majority voting. 4. The contributions are empirically well-demonstrated through detailed ablation studies and analysis. 5

Weaknesses

1. The evaluation is limited to mathematical reasoning (Table 2), lacking experiments on logical or commonsense reasoning tasks that could better demonstrate generality. 2. The most significant concern is the absence of strong single-turn training baselines such as GRPO, DAPO, or DPO. To convincingly argue for the efficiency and effectiveness of training, comparisons with these methods are needed. 3. In Figure 5, the test-time scalability comparison between self-verify BoN and majority voting

Reviewer 02Rating 4Confidence 4

Strengths

1. This selective revision explicitly tackles collapse and yields better revision efficiency than always-revise baselines. Originality lies in unifying policy and verifier into one model with a *selective* verify-then-revise loop, plus stable multi-turn RL adaptations (turn-independent advantages + RoleAdvNorm + improvement bonus). 2. On Qwen2.5-7B, PAG reaches 82.3% Acc.@final on MATH500 and the best average 38.3%, outperforming Single-Turn, Direct Multi-Turn, and SCoRe; verifier performance o

Weaknesses

1. Core results emphasize primarily on math; extensions to logic/coding are in the appendix but with much less emphasis. 2. The baselines are a bit "tricky", SCoRe is re-implemented (I know the code is not released), and for non-rl baselines such as self-consistency with majority voting at large N (policy-only BoN), the experiments are not under the same compute budget and decoding settings. 3. Appendix shows that dropping reward on either the first policy turn or final output collapses one role

Reviewer 03Rating 8Confidence 5

Strengths

1. The proposed method simplifies over existing works on LLM self-correction by unifying verification and solving within the same model. As mentioned in the summary this removes the need for any SFT or RL fine-tuning as a warm start to RL training, which is a significant improvement. 2. The changes made to RL training mentioned in the summary are quite reasonable and minor. They are also shown through ablations to be necessary for getting the proposed method to work well. They confirm findings

Weaknesses

1. The improvements from PAG are modest, especially in Tables 2 (math) and 9 (code generation) when compared do SCoRe. Although this still could be considered interesting as the method is simpler.

Reviewer 04Rating 4Confidence 5

Strengths

**[S1]** The paper is well written and the proposed method (PAG) is conceptually sound **[S2]** The experiments are generally well executed across multiple datasets

Weaknesses

**[W1: Unclear benefit over single-turn RL].** While multi-turn verification sounds intuitively useful, it remains unclear what concrete benefit it provides over strong single-turn RL methods (e.g., GRPO or PPO with a single verification step), which already achieve strong reasoning and self-verification performance [1]. Prior works have shown that single-turn RL-trained models can already conduct implicit self-checks or “rethink” their answers during inference [2]. Furthermore, the author shoul

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications

MethodsPerturbed-Attention Guidance