PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning

Wenquan Lu; Hai Huang; Enqi Liu; Randall Balestriero

arXiv:2602.03190·cs.LG·May 12, 2026

PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning

Wenquan Lu, Hai Huang, Enqi Liu, Randall Balestriero

PDF

1 Repo 7 Models

TL;DR

PrAg-PO is a novel reinforcement learning method that enhances mathematical reasoning in language models by promoting diverse prompts and formats during training, leading to improved accuracy and robustness.

Contribution

It introduces Prompt Augmented Policy Optimization (PrAg-PO), a simple approach that mixes prompt templates with format rewards to increase diversity and robustness in reasoning models.

Findings

01

PrAg-PO outperforms existing methods like GRPO and DAPO in reasoning accuracy.

02

PrAg-PO mitigates premature training collapse.

03

PrAg-PO achieves competitive results on mathematics benchmarks.

Abstract

Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. While a growing body of work seeks to improve training entropy, rollout diversity, and exploration, most existing methods still train models with a single fixed reasoning prompt or template, which can encourage prompt-specific overfitting and unstable training dynamics. In this work, we introduce Prompt Augmented Policy Optimization (PrAg-PO), a simple policy optimization method that mixes prompt templates with template-specific format rewards during training. By encouraging models to generate reasoning traces under diverse instructions and output formats, PrAg-PO increases rollout diversity and improves robustness. Compared with GRPO and DAPO, PrAg-PO achieves significantly higher reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenquanlu/PrAg-PO
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.