ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
Ignacy Kolton, Kacper Marzol, Pawe{\l} Batorski, Marcin Mazur, Paul Swoboda, Przemys{\l}aw Spurek

TL;DR
ReLAPSe introduces a reinforcement learning framework that efficiently restores erased concepts in diffusion models by directly leveraging model feedback, enabling scalable and near-real-time concept recovery.
Contribution
It pioneers a policy-based adversarial approach using reinforcement learning with verifiable rewards for concept restoration in unlearned diffusion models.
Findings
ReLAPSe achieves near-real-time concept recovery.
It outperforms existing optimization-based methods in efficiency.
It effectively restores fine-grained identities and styles across various unlearning techniques.
Abstract
Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
