Characterizing Selective Refusal Bias in Large Language Models
Adel Khorramrouz, Sharon Levy

TL;DR
This paper investigates how safety guardrails in large language models can unintentionally introduce biases by refusing to generate content for certain demographic groups, highlighting the need for more equitable safety measures.
Contribution
It characterizes the selective refusal bias in LLM safety guardrails across multiple demographic attributes and explores its implications for fairness and robustness.
Findings
Evidence of selective refusal bias across gender, nationality, religion, and sexual orientation.
Refusal rates vary significantly among different demographic groups.
Targeted attacks on refused groups reveal safety vulnerabilities.
Abstract
Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Authorship Attribution and Profiling
