Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Rahaf Abu Hara; Vaibbhav Murarri; Claudio Zito

arXiv:2605.08315·cs.LG·May 12, 2026

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Rahaf Abu Hara, Vaibbhav Murarri, Claudio Zito

PDF

TL;DR

This paper introduces R2PO, a novel two-stage LLM framework that enhances policy optimization by incorporating trajectory-level behavioral evidence, leading to faster, more stable learning and better performance across multiple environments.

Contribution

The paper proposes R2PO, a new LLM-based policy search method that uses trajectory-grounded revisions and addresses salience bias, significantly improving learning efficiency and stability.

Findings

01

R2PO achieves the highest mean best reward across ten environments.

02

R2PO reaches near-optimal performance faster than prior methods.

03

Mitigating salience bias improves policy revision effectiveness.

Abstract

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.