Interactive Critique-Revision Training for Reliable Structured LLM Generation

Fei Xu Yu; Zuyuan Zhang; Mahdi Imani; Nathaniel D. Bastian; Tian Lan

arXiv:2605.08327·cs.LG·May 12, 2026

Interactive Critique-Revision Training for Reliable Structured LLM Generation

Fei Xu Yu, Zuyuan Zhang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

PDF

TL;DR

This paper introduces DPA-GRPO, a novel paired-action training method for structured LLM decision-making, improving accuracy and reliability in tasks requiring local correctness and global consistency.

Contribution

It presents a new generator-verifier training framework with SAC interventions, enhancing structured decision accuracy over existing methods.

Findings

01

DPA-GRPO outperforms zero-shot and RL baselines in structured decision tasks.

02

Training increases silent acceptance and reduces missed errors.

03

Method improves calibrated revision behavior for both generator and verifier.

Abstract

In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator--verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.