CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen; Vishal M. Patel

arXiv:2511.05865·cs.CV·November 26, 2025

CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen, Vishal M. Patel

PDF

Open Access

TL;DR

CGCE introduces a scalable, classifier-guided framework for robust concept erasure in generative models, effectively balancing safety and quality without altering original model weights.

Contribution

The paper presents a novel plug-and-play method that enhances concept erasure robustness in generative models using lightweight classifiers on text embeddings.

Findings

01

Achieves state-of-the-art robustness against adversarial attacks.

02

Maintains high generative quality on safe prompts.

03

Applicable to various T2I and T2V models.

Abstract

Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection