Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm

Prakul Sunil Hiremath

arXiv:2604.07428·cs.LG·April 10, 2026

Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm

Prakul Sunil Hiremath

PDF

TL;DR

This paper introduces RAPO, a method that suppresses harmful replay in reinforcement learning by environment-level modifications, improving safety without significantly sacrificing task performance.

Contribution

The paper proposes RAPO, a novel environment augmentation technique that reduces harmful replay in RL, addressing a key safety challenge under delayed harm scenarios.

Findings

01

RAPO reduces re-amplification gain from 0.98 to 0.33 on large graphs.

02

RAPO retains 82% of task return despite environment modifications.

03

Disabling environment deformation during replay restores re-amplification, confirming environment-level deformation as causal.

Abstract

Safety in reinforcement learning (RL) is typically enforced through objective shaping while keeping environment dynamics stationary with respect to observable state-action pairs. Under delayed harm, this can lead to replay: after a washout period, reintroducing the same stimulus under matched observable conditions reproduces a similar harmful cascade. We introduce the Replay Suppression Diagnostic (RSD), a controlled exposure-decay-replay protocol that isolates this failure mode under frozen-policy evaluation. We show that, under stationary observable transition kernels, replay cannot be structurally suppressed without inducing a persistent shift in replay-time action distributions. Motivated by platform-mediated systems, we propose Regret-Aware Policy Optimization (RAPO), which augments the environment with persistent harm-trace and scar fields and applies a bounded,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.