Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias
Rahaf Abu Hara, Vaibbhav Murarri, Claudio Zito

TL;DR
This paper introduces R2PO, a novel two-stage LLM framework that enhances policy optimization by incorporating trajectory-level behavioral evidence, leading to faster, more stable learning and better performance across multiple environments.
Contribution
The paper proposes R2PO, a new LLM-based policy search method that uses trajectory-grounded revisions and addresses salience bias, significantly improving learning efficiency and stability.
Findings
R2PO achieves the highest mean best reward across ten environments.
R2PO reaches near-optimal performance faster than prior methods.
Mitigating salience bias improves policy revision effectiveness.
Abstract
Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
