TL;DR
SafeRedir is a novel inference-time framework that redirects unsafe prompts to safe regions in image generation models, enabling robust unlearning without retraining or modifying the models.
Contribution
It introduces a prompt embedding redirection method with a safety classifier and token-level interventions, improving safety and robustness in image generation models.
Findings
Effective unlearning of unsafe concepts demonstrated across multiple tasks.
High preservation of benign content and image quality.
Enhanced resistance to adversarial prompt attacks.
Abstract
Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
