CGCE: Classifier-Guided Concept Erasure in Generative Models
Viet Nguyen, Vishal M. Patel

TL;DR
CGCE introduces a scalable, classifier-guided framework for robust concept erasure in generative models, effectively balancing safety and quality without altering original model weights.
Contribution
The paper presents a novel plug-and-play method that enhances concept erasure robustness in generative models using lightweight classifiers on text embeddings.
Findings
Achieves state-of-the-art robustness against adversarial attacks.
Maintains high generative quality on safe prompts.
Applicable to various T2I and T2V models.
Abstract
Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
