ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton; Kacper Marzol; Pawe{\l} Batorski; Marcin Mazur; Paul Swoboda; Przemys{\l}aw Spurek

arXiv:2602.00350·cs.CV·February 3, 2026

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton, Kacper Marzol, Pawe{\l} Batorski, Marcin Mazur, Paul Swoboda, Przemys{\l}aw Spurek

PDF

Open Access

TL;DR

ReLAPSe introduces a reinforcement learning framework that efficiently restores erased concepts in diffusion models by directly leveraging model feedback, enabling scalable and near-real-time concept recovery.

Contribution

It pioneers a policy-based adversarial approach using reinforcement learning with verifiable rewards for concept restoration in unlearned diffusion models.

Findings

01

ReLAPSe achieves near-real-time concept recovery.

02

It outperforms existing optimization-based methods in efficiency.

03

It effectively restores fine-grained identities and styles across various unlearning techniques.

Abstract

Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications