TL;DR
PrAg-PO is a novel reinforcement learning method that enhances mathematical reasoning in language models by promoting diverse prompts and formats during training, leading to improved accuracy and robustness.
Contribution
It introduces Prompt Augmented Policy Optimization (PrAg-PO), a simple approach that mixes prompt templates with format rewards to increase diversity and robustness in reasoning models.
Findings
PrAg-PO outperforms existing methods like GRPO and DAPO in reasoning accuracy.
PrAg-PO mitigates premature training collapse.
PrAg-PO achieves competitive results on mathematics benchmarks.
Abstract
Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. While a growing body of work seeks to improve training entropy, rollout diversity, and exploration, most existing methods still train models with a single fixed reasoning prompt or template, which can encourage prompt-specific overfitting and unstable training dynamics. In this work, we introduce Prompt Augmented Policy Optimization (PrAg-PO), a simple policy optimization method that mixes prompt templates with template-specific format rewards during training. By encouraging models to generate reasoning traces under diverse instructions and output formats, PrAg-PO increases rollout diversity and improves robustness. Compared with GRPO and DAPO, PrAg-PO achieves significantly higher reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗daviddavidlu/DAPO-with-prompt-augmentation-step2820model· 98 dl98 dl
- 🤗daviddavidlu/DAPO-with-prompt-augmentation-step2720model· 90 dl90 dl
- 🤗daviddavidlu/DAPO-with-prompt-augmentation-step2480model· 84 dl84 dl
- 🤗daviddavidlu/PrAg-PO-DeepSeek-R1-Distill-Qwen-1.5B-step1100model· 40 dl40 dl
- 🤗daviddavidlu/PrAg-PO-DeepSeek-R1-Distill-Qwen-1.5B-step1160model· 41 dl41 dl
- 🤗daviddavidlu/PrAg-PO-Qwen3-1.7b-step1520model· 36 dl36 dl
- 🤗daviddavidlu/PrAg-PO-Qwen3-1.7b-step720model· 86 dl86 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
