Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galv\~ao Filho

TL;DR
This paper uncovers the 'Selective Safety Trap' in LLMs, showing models defend some groups better than others, and introduces MiJaBench, a bilingual benchmark to evaluate demographic safety disparities.
Contribution
It introduces MiJaBench, a large bilingual benchmark for auditing demographic safety disparities in LLMs, and demonstrates that safety alignment varies across groups and improves with targeted optimization.
Findings
Safety defense rates vary up to 42% across demographics within the same model.
Current safety alignment learns group-specific safeguards rather than general harm mitigation.
Targeted DPO improves safety generalization to unseen demographics and attack strategies.
Abstract
Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific populations. In this work, we expose the Selective Safety Trap: a systemic failure mode where models robustly defend specific populations while leaving underrepresented communities highly vulnerable to identical adversarial attacks. To systematically audit this phenomenon, we introduce MiJaBench, a bilingual (English-Portuguese) adversarial benchmark comprising 43,961 controlled jailbreaking prompts across 16 minority groups. By evaluating 14 state-of-the-art LLMs on MiJaBench, we curate 615,454 prompt-response pairs that compose MiJaBench-Align, revealing that safety alignment is not a uniform semantic capability but a demographic hierarchy, with defense rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
