Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm
Prakul Sunil Hiremath

TL;DR
This paper introduces RAPO, a method that suppresses harmful replay in reinforcement learning by environment-level modifications, improving safety without significantly sacrificing task performance.
Contribution
The paper proposes RAPO, a novel environment augmentation technique that reduces harmful replay in RL, addressing a key safety challenge under delayed harm scenarios.
Findings
RAPO reduces re-amplification gain from 0.98 to 0.33 on large graphs.
RAPO retains 82% of task return despite environment modifications.
Disabling environment deformation during replay restores re-amplification, confirming environment-level deformation as causal.
Abstract
Safety in reinforcement learning (RL) is typically enforced through objective shaping while keeping environment dynamics stationary with respect to observable state-action pairs. Under delayed harm, this can lead to replay: after a washout period, reintroducing the same stimulus under matched observable conditions reproduces a similar harmful cascade. We introduce the Replay Suppression Diagnostic (RSD), a controlled exposure-decay-replay protocol that isolates this failure mode under frozen-policy evaluation. We show that, under stationary observable transition kernels, replay cannot be structurally suppressed without inducing a persistent shift in replay-time action distributions. Motivated by platform-mediated systems, we propose Regret-Aware Policy Optimization (RAPO), which augments the environment with persistent harm-trace and scar fields and applies a bounded,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
