REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models
Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu

TL;DR
REFORGE is a black-box adversarial framework that tests the robustness of image generation model unlearning methods, revealing vulnerabilities and emphasizing the need for more robust unlearning techniques.
Contribution
This paper introduces REFORGE, a novel multi-modal attack framework that evaluates the robustness of image generation model unlearning in black-box settings.
Findings
REFORGE significantly increases attack success rates.
It achieves stronger semantic alignment and higher efficiency.
Vulnerabilities in current unlearning methods are exposed.
Abstract
Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Hate Speech and Cyberbullying Detection
