Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation
Xin Zhao, Xiaojun Chen, Yuexin Xuan, Zhendong Zhao, Xiaojun Jia,, Xinfeng Li, Xiaofeng Wang

TL;DR
Buster is a novel framework that embeds semantic backdoors into text encoders to effectively and efficiently prevent NSFW content generation in text-to-image models, outperforming existing methods.
Contribution
It introduces a scalable, robust backdoor injection method into text encoders that redirects harmful prompts to benign outputs with minimal fine-tuning time.
Findings
Achieves at least 91.2% NSFW content removal rate.
Fine-tunes the text encoder within five minutes.
Outperforms nine state-of-the-art baselines.
Abstract
The rise of deep learning models in the digital era has raised substantial concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. Nevertheless, these approaches largely lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To address these challenges, we propose an innovative framework named \textit{Buster}, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Additionally, Buster employs energy-based training data generation through Langevin dynamics for adversarial knowledge augmentation, thereby ensuring robustness in harmful concept…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Web Application Security Vulnerabilities · Security and Verification in Computing
