P^2O: Joint Policy and Prompt Optimization
Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun

TL;DR
P^2O is a novel method that combines policy and prompt optimization to improve reinforcement learning with large language models, especially on hard samples where traditional methods struggle.
Contribution
It introduces a joint optimization framework using GEPA and context distillation to enhance LLM reasoning and out-of-distribution performance.
Findings
Restores advantage signals in RLVR on hard samples.
Outperforms standard methods with doubled rollout budgets.
Achieves up to 9.5% performance improvement.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on ``hard samples'' where all rollouts fail. This lack of variance eliminates crucial learning signals. For these intractable samples, simply scaling up rollout budgets offers limited gains. We introduce Joint Policy and Prompt Optimization (PO) to mitigate this collapse by alternating continuous policy updates with discrete prompt evolution. PO leverages the GEPA algorithm to discover successful reasoning prompts for intractable instances. Via context distillation, the model internalizes these prompt-induced gains directly into its parameters, removing the need for inference-time prompting. Empirically, PO restores critical advantage signals, significantly outperforming standard GRPO and surpassing baselines with doubled rollout budgets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
