AutoDebias: Automated Framework for Debiasing Text-to-Image Models
Hongyi Cai, Mohammad Mahdinur Rahman, Mingkang Dong, Muxin Pu, Moqyad Alqaily, Jie Li, Xinfeng Li, Jialie Shen, Meikang Qiu, Qingsong Wen

TL;DR
AutoDebias is an automated framework that detects and mitigates malicious biases in text-to-image models, effectively addressing subtle backdoor attacks without prior knowledge, while maintaining image quality.
Contribution
It introduces AutoDebias, a novel automated method leveraging vision-language models to identify and neutralize malicious biases in T2I models without prior attack knowledge.
Findings
Detects malicious patterns with 91.6% accuracy.
Reduces backdoor success rate from 90% to negligible levels.
Preserves image quality and diversity after debiasing.
Abstract
Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberately and subtly injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack types. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Hate Speech and Cyberbullying Detection
