BadFair: Backdoored Fairness Attacks with Group-conditioned Triggers
Jiaqi Xue, Qian Lou, Mengxin Zheng

TL;DR
BadFair is a novel backdoored attack that manipulates fairness in models, remaining stealthy under normal conditions but causing targeted discrimination when triggered, with high success rates and minimal accuracy loss.
Contribution
We introduce BadFair, a new backdoor attack method that specifically targets fairness mechanisms, exposing vulnerabilities and bypassing existing fairness detection techniques.
Findings
Achieves over 85% attack success rate on target groups
Maintains high model accuracy with minimal loss
Consistently causes significant discrimination in targeted groups
Abstract
Attacking fairness is crucial because compromised models can introduce biased outcomes, undermining trust and amplifying inequalities in sensitive applications like hiring, healthcare, and law enforcement. This highlights the urgent need to understand how fairness mechanisms can be exploited and to develop defenses that ensure both fairness and robustness. We introduce BadFair, a novel backdoored fairness attack methodology. BadFair stealthily crafts a model that operates with accuracy and fairness under regular conditions but, when activated by certain triggers, discriminates and produces incorrect results for specific groups. This type of attack is particularly stealthy and dangerous, as it circumvents existing fairness detection methods, maintaining an appearance of fairness in normal use. Our findings reveal that BadFair achieves a more than 85% attack success rate in attacks aimed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
