Interactive Critique-Revision Training for Reliable Structured LLM Generation
Fei Xu Yu, Zuyuan Zhang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

TL;DR
This paper introduces DPA-GRPO, a novel paired-action training method for structured LLM decision-making, improving accuracy and reliability in tasks requiring local correctness and global consistency.
Contribution
It presents a new generator-verifier training framework with SAC interventions, enhancing structured decision accuracy over existing methods.
Findings
DPA-GRPO outperforms zero-shot and RL baselines in structured decision tasks.
Training increases silent acceptance and reduces missed errors.
Method improves calibrated revision behavior for both generator and verifier.
Abstract
In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator--verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
